Fast list generation from the narrow data.frame - r

I have a data.frame in the narrow format like that:
test_data<-data.frame(id=rep(1:200,50),
variable=sample(LETTERS,10000,T),
value=sample(letters,10000,T), stringsAsFactors = F)
I want to get a list containing lists for each id with each variable as a vector inside (something like simple JSON format):
list("1"=list(A=c("a","b"), B=character(), C="v" ...
My code:
return_data <-
sapply(unique(test_data$id), function (r)
sapply(LETTERS, function(q)
test_data[test_data$id == r & test_data$variable == q, "value"],
USE.NAMES = T,simplify = F),
USE.NAMES = T,simplify = F)
It works, but it is too slow with large samples. I've rewritten it with data. table, but it is still slow. I've got some improvement with parSapply, but I believe there should be more effective algorithm...

You can use split with variable being a factor, thanks to the drop = FALSE behavior (on by default but passed explicitly here for readability). With magrittr:
library(magrittr)
res = test_data %>%
transform(variable = factor(variable)) %>%
split(.["id"]) %>%
lapply(function(x) split(x$value, x["variable"], drop = FALSE))
all.equal(unname(res), return_data) # TRUE
The same without magrittr:
new_test_data = transform(test_data, variable = factor(variable))
sp_id = split(new_test_data , new_test_data["id"])
res2 = lapply(sp_id, function(x) split(x$value, x["variable"], drop = FALSE))
all.equal(unname(res2), return_data) # TRUE

Related

R, How to refer to variable name as string, in function that uses arrow:open_dataset internally

Trying to create a function that will compute the average of some variable, whose name is provided in the function. For instance:
mean_of_var <- function(var){
open_dataset('myfile') %>% summarise(meanB=mean(get(var) ,na.rm = T),
medianB=median(get(var),na.rm = T)) %>% collect %>% return
}
mean_of_var('myvar')
The main problem is that arrow:open_dataset does not support the get() function. So I get the error message:
Error: Error : Expression mean(get(var), na.rm = T) not supported in
Arrow Call collect() first to pull data into R.
Is there a way to write a function like that, while keeping the use of the "open_dataset('myfile')" function.
The dplyr verbs used in arrow rely on "tidy evaluation". You therefore need to "embrace" your variable names within your function:
library(arrow)
library(dplyr)
## create a parquet file to read with `open_dataset()`
pq_file <- tempfile(fileext = ".parquet")
dd <- tibble::tibble(
col1 = rnorm(100),
col2 = rnorm(100),
col3 = rnorm(100)
)
write_parquet(dd, sink = pq_file)
mean_of_var <- function(var) {
open_dataset(pq_file) %>%
summarize(
meanB = mean({{ var }}, na.rm = TRUE),
medianB = median({{ var }}, na.rm = TRUE)
) %>%
collect()
}
To use the function:
mean_of_var(col2)

Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}

Writing kind of for-loop using dplyr in R

Is there a possibility to use a kind of for loop inside a dplyr syntax? I'm using the following syntax to check the presence of MAP<99, MAP<98 and so on until MAP<1. Not very efficient, so I like to repeat this function from MAP< [100:1].
duur2_vs_diepte <- data_blood_pressure %>%
summarise(
duur_tm99_2 = (sum(MAP<=99))^2,
duur_tm98_2 = (sum(MAP<=98))^2,
duur_tm97_2 = (sum(MAP<=97))^2,
.......
duur_tm4_2 = (sum(MAP<=4))^2,
duur_tm3_2 = (sum(MAP<=3))^2,
duur_tm2_2 = (sum(MAP<=2))^2,
duur_tm1_2 = (sum(MAP<=1))^2
)
This may work for you:
# a helping function to create each column
create_columns <- function(x, mat) {
dt <- mat %>%
filter(MAP <= x) %>%
summarise(sum(MAP, na.rm = TRUE)^2)
names(dt) <- paste0("duur_tm", x, "_2")
dt
}
# get all results together
bind_cols(lapply(100:1, create_columns, data_blood_pressure))

map + pmap, cannot find variables

I am trying to collate results from a simulation study using dplyr and purrr. My results are saved as a list of data frames with the results from several different classification algorithms, and I'm trying to use purrr and dplyr to summarize these results.
I'm trying to calculate
- number of objects assigned to each cluster
- number of objects in the cluster that actually belong to the cluster
- number of true positives, false positives, false negatives, and true negatives using 3 different algorithms (KEEP1 - KEEP3)
- for 2 of the algorithms, I have access to a probability of being in the cluster, so I can compare this to alternate choices of alpha - and so I can calculate true positives etc. using a different choice of alpha.
I found this: https://github.com/tidyverse/dplyr/issues/3101, which I used successfully on a single element of the list to get exactly what I wanted:
f <- function(.x, .y) {
sum(.x & .y)
}
actions <- list(
.vars = lst(
c('correct'),
c('KEEP1', 'KEEP2', 'KEEP3'),
c('pval1', 'pval2')
),
.funs = lst(
funs(Nk = length, N_correct = sum),
funs(
TP1 = f(., .y = correct),
FN1 = f(!(.), .y = correct),
TN1 = f(!(.), .y = !(correct)),
FP1 = f(., .y = !(correct))
),
funs(
TP2 = f((. < alpha0) , .y = correct),
FN2 = f(!(. < alpha0), .y = correct),
TN2 = f(!(. < alpha0), .y = !(correct)),
FP2 = f((. < alpha0), .y = !(correct))
)
)
)
reproducible_data <- replicate(2,
data_frame(
k = factor(rep(1:10, each = 20)), # group/category
correct = sample(x = c(TRUE, FALSE), 10 * 20, replace = TRUE, prob = c(.8, .2)),
pval1 = rbeta(10 * 20, 1, 10),
pval2 = rbeta(10 * 20, 1, 10),
KEEP1 = pval1 < 0.05,
KEEP2 = pval2 < 0.05,
KEEP3 = runif(10 * 20) > .2,
alpha0 = 0.05,
alpha = 0.05 / 20 # divided by no. of objects in each group (k)
),
simplify = FALSE)
# works
df1 <- reproducible_data[[1]]
pmap(actions, ~df1 %>% group_by(k) %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
Now, I want to use map to do this to the entire list. However, I can no longer access the variable "correct" (it hasn't gotten far enough to not see alpha or alpha0, but presumably the same issue will occur). I'm still learning dplyr/purrr, but my experimenting hasn't proved useful.
# does not work
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
)
# this doesn't either
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y, alpha = alpha, alpha0 = alpha0, correct = correct)) %>%
reduce(inner_join,by = 'k')
)
Within map, I don't see the variable 'k' in $group_by(k)$ unless it is quoted $group_by('k')$, but I do not need to quote it when I just used pmap. I've tried various ways to pass the correct variables to these functions, but I'm still learning dplyr and purrr, and haven't succeeded yet.
One more note - the actual data is stored as a regular data frame, so I need $as_tibble()$ in the pmap function. I was running into some different errors when I removed it in this example, so I opted to add it back so I would get the same issues. Thanks!
Try this
map(
reproducible_data,
function(df1) {
pmap(actions, ~ df1 %>%
as_tibble() %>%
group_by(k) %>%
summarize_at(.x, .y)) %>%
reduce(inner_join, by = "k")
}
)
I think your arguments might get mixed up when using map and pmap at the same time. I used the function syntax for map to define df1 to try to fix that. The rest of it looks ok (although I switched to pmap_df to return a dataframe (the structure of the list was ugly without it and pmap_df was the easiest way to make it pretty. Lmk if it's not the expected output. 👍
Also the problem with group_by("k") vs. group_by(k)
Also: writing group_by("k") actually creates a variable "k" and fills it with characters "k", then uses that to group. That will get your code to run, but it won't do what you like. Sometimes that kind of problem is really because of an error that occurs a line or two before (or, with dplyr, a pipe or two before). In this case, map wasn't passing df1 where you needed it.

R validate package. Create indicator object for arbitrary column

I'm new to R and I'm coming up against a problem that makes me very puzzled about how the language works. There is a package, "validate", that can create objects you can use to check that your data is as expected.
Testing it out on some toy data, I found that while the following code worked as expected:
library(validate)
I <- indicator(
cnt_misng = number_missing(x)
, sum = sum(x, na.rm = TRUE)
, min = min(x, na.rm = TRUE)
, mean = mean(x, na.rm = TRUE)
, max = max(x, na.rm = TRUE)
)
dat <- data.frame(x=1:4, y=c(NA,11,7,8), z=c(NA,2,0,NA))
C <- confront(dat, I)
values(C)
However, I found that I could not create a function that would return an indicator object for any arbitrary column of the data frame. This was my failed attempt:
check_values <- function(data, x){
print(x)
I <- indicator(
cnt_misng = number_missing(eval(x))
, max = max(eval(x), na.rm = TRUE)
)
C <- confront(df, I)
return(C)
}
df <- data.frame(A=1:4, B=c(NA,11,7,8), C=c(NA,2,0,NA))
C <- check_values(df,'B')
values(C)
If I have a large dataset, I'd like to be able to loop through a list of columns and have an identically formatted report for each one in the list. At this point, I'll probably give up on this package and find another way to more directly do that. However, I am still curious how this could be made to work. It seems like there should be a way to functionalize the creation of this indicator object so I can reuse the code to check the same stats for any arbitrary column of a data frame.
Any ideas?

Resources