How do I use arrange inside a function?

How do I use arrange inside a function? - r

I am trying to create a user-defined function which carries out some data transformations.
Mock data:
library(tidyverse)
set.seed(1)
sampledata_a <- data.frame(
patientid = sample(1:100),
servicetype = sample(c("service1", "service2", "service3", "service4", "service5"), 100, replace=TRUE),
date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 100)
)
sampledata_b <- data.frame(
patientid = sample(1:100),
servicetype = sample(c("service6", "service7", "service8", "service9", "service10"), 100, replace=TRUE),
date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 100)
)
sampledata1 <- rbind(sampledata_a, sampledata_b)
User-defined function:
get_most_recent_and_unique <- function(inputdata, groupbyvar, uniquevar, datevar) {
# first selects the most recent observation for each unique variable combination
outputdata <- inputdata %>%
distinct() %>%
arrange(groupbyvar, uniquevar, desc(datevar)) %>%
mutate(orderkey = paste0(groupbyvar, uniquevar, sep = "")) %>%
group_by(orderkey) %>%
do(head(., n=1)) %>%
ungroup() %>%
arrange(groupbyvar, desc(datevar), uniquevar)
# then tranpose from long to wide, and unite variables other than first variable into one
outputdata <- outputdata %>%
select(groupbyvar, uniquevar) %>%
group_by(groupbyvar) %>%
mutate(pos=1:n()) %>%
spread(pos, uniquevar) %>%
unite(uniquevar, -groupbyvar, sep=" / ")
return(outputdata)
}
When running the function as below:
outputdata <- get_most_recent_and_unique(sampledata1, "patientid", "servicetype", "date")
Following error message:
Error in arrange_impl(.data, dots) :
incorrect size (1) at position 1, expecting : 100
However, the code works fine when outside the user-defined function. I wonder if anyone can tell me what is wrong?
testoutputdata <- sampledata1 %>%
distinct() %>%
arrange(patientid, servicetype, desc(date)) %>%
mutate(orderkey = paste0(patientid, servicetype, sep = "")) %>%
group_by(orderkey) %>%
do(head(., n=1)) %>%
ungroup() %>%
arrange(patientid, desc(date), servicetype)
testoutputdata <- testoutputdata %>%
select(patientid, servicetype) %>%
group_by(patientid) %>%
mutate(pos=1:n()) %>%
spread(pos, servicetype) %>%
unite(servicetype, -patientid, sep=" / ")

Try this:
get_most_recent_and_unique <- function(inputdata, groupbyvar, uniquevar, datevar) {
groupbyvar <- enquo(groupbyvar)
uniquevar <- enquo(uniquevar)
datevar <- enquo(datevar)
# first selects the most recent observation for each unique variable combination
outputdata <- inputdata %>%
distinct() %>%
arrange(!! groupbyvar, !! uniquevar, desc(!! datevar)) %>%
mutate(orderkey := paste0(!! groupbyvar, !! uniquevar, sep = "")) %>%
group_by(orderkey) %>%
do(head(., n=1)) %>%
ungroup() %>%
arrange(!! groupbyvar, desc(!! datevar), !! uniquevar)
# then tranpose from long to wide, and unite variables other than first variable into one
outputdata <- outputdata %>%
select(!! groupbyvar, !! uniquevar) %>%
group_by(!! groupbyvar) %>%
mutate(pos=1:n()) %>%
spread(pos, !! uniquevar) %>%
unite(!! uniquevar, -!! groupbyvar, sep=" / ")
return(outputdata)
}
outputdata <- get_most_recent_and_unique(sampledata1, patientid, servicetype, date) # No quotation with arguments!
Here is the output:
patientid servicetype
<int> <chr>
1 1 service7 / service3
2 2 service10 / service1
3 3 service4 / service9
4 4 service8 / service3
5 5 service6 / service1
It seems to match your expectations when I compare them:
all.equal(outputdata, testoutputdata)
[1] TRUE
Note that you shouldn't quote the arguments when specifying the function, i.e. outputdata <- get_most_recent_and_unique(sampledata1, patientid, servicetype, date) will work while outputdata <- get_most_recent_and_unique(sampledata1, "patientid", "servicetype", "date") won't.

Related

Dynamically change the column name created using summarise() and complete()

I'm trying to dynamically create an extra column. The first piece of code works as i want it to:
library(dplyr)
library(tidyr)
set.seed(1)
df <- data.frame(animals = sample(c('dog', 'cat', 'rat'), 100, replace = T))
my_fun <- function(data, column_name){
data %>% group_by(animals) %>%
summarise(!!column_name := n())
}
my_fun(df, 'frequency')
Here i also use the complete function and it doesn't work:
library(dplyr)
set.seed(1)
df <- data.frame(animals = sample(c('dog', 'cat', 'rat'), 100, replace = T))
my_fun <- function(data, column_name){
data %>% group_by(animals) %>%
summarise(!!column_name := n())%>%
ungroup() %>%
complete(animals = c('dog', 'cat', 'rat', 'bat'),
fill = list(!!column_name := 0))
}
my_fun(df, 'frequency')
The list function doesn't seem to like !!column_name :=
Is there something i can do to make this work? Basically i want the second piece of code to output:
animals frequency
bat 0
cat 38
dog 27
rat 35

You could keep the fill argument of complete() as the default (which will give you the missing values as NA) and subsequently replace them with 0:
my_fun <- function(data, column_name){
data %>%
group_by(animals) %>%
summarise(!!column_name := n())%>%
ungroup() %>%
complete(animals = c('dog', 'cat', 'rat', 'bat')) %>%
mutate_all(~replace(., is.na(.), 0))
}

Assigning floor date in groupby fails

this is my code and I have a problem with groupby :
library(dplyr)
library(lubridate)
df <- read.xlsx("Data.xlsx", sheet = "Sector-STOXX600", startRow = 2,colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE)
df[2:19] <- data.matrix(df[2:19])
percent_change2 <- function(x)last(x)/first(x) - 1
monthly_return <- df %>%
group_by(gr = floor_date(Date, unit = "month")) %>%
summarize_at(vars(-Date, -gr), percent_change2) %>%
ungroup() %>%
select(-gr) %>%
as.matrix()
Indeed I have this error :
"Error in is_character(x) : object 'gr' not found"
Here is a sample of the dataset :
Date .SXQR .SXTR .SXNR .SXMR .SXAR .SX3R .SX6R .SXFR .SXOR .SXDR .SX4R .SXRR .SXER
1 2000-01-03 364.94 223.93 489.04 586.38 306.56 246.81 385.36 403.82 283.78 455.39 427.43 498.08 457.57
2 2000-01-04 345.04 218.90 474.05 566.15 301.13 239.24 374.64 390.41 275.93 434.92 414.10 476.17 435.72
UPDATE
volatility_function<- function(x)sqrt(252) * sd(diff(log(x))) * 100
annualized_volatility <- df %>%
mutate(Date=ymd(Date)) %>%
group_by(gr = floor_date(Date, unit = "year")) %>%
select(gr,everything()) %>%
summarize_at(vars(-Date, -gr), volatility_function) %>%
ungroup() %>% select(-gr) %>%
as.matrix()
head(annualized_volatility,5)
I tried what #NeslonGon told me to do, however I know get the same error on an another function, what should I do ?

The idea is that we don't need to summarise_at a grouped variable but use the Date to account for this. The select and mutate calls can be skipped. They're for convenience.
df %>%
mutate(Date=ymd(Date)) %>%
group_by(gr = floor_date(Date, unit = "month")) %>%
select(gr,everything()) %>%
summarize_at(vars(-Date), percent_change2) %>%
ungroup() %>%
select(-gr) %>%
as.matrix()

furrr package in R doesn't keep spreading the jobs across all cores?

My task is to compute Cosine dissimilarities.
Given a dataframe of user observations I perform a cosine dissimilarity between each pair of rows.
Long story short I am using furrr::future_map2_dfr function to spread the calculations across all cores I have.
For some reason when some cores are free while others are working hard their work doesn't keep spreading across other cores.
For example:
Here is the start point:
Now it's in the middle of the calculation:
Why cores 1, 2, 5, 6, 8, 11, 12, 15 doesn't participate and share the left jobs?
Same with other calculations.
Do I miss any settings of furrr that can change current behavior?
P.S
Now there are 5 cores that work "hard" and for some reason furrr doesn't spread their work to all 16 cores to make it faster.
Functions:
dissimilarity_wrapper <- function(n_users,
train_data,
train_data_std,
test_data,
std_thresh = 0.5) {
# NOTE:
# n_users must be set to maximum users in order to make this function
# work properly.
# Generating the options:
user_combinations <- expand.grid(i = seq_len(n_users),
j = seq_len(n_users))
plan(strategy = multicore)
expand_grid_options <- furrr::future_map2_dfr(.x = user_combinations$i,
.y = user_combinations$j,
function(x, y) {
expand.grid(test_idx = which(test_data$user_id == x),
train_idx = which(train_data$user_id == y))})
drop <- c("user_id", "row_num",
"obs_id", "scroll_id",
"time_stamp", "seq_label",
"scroll_length")
test <- test_data[expand_grid_options$test_idx, !names(test_data) %in% drop]
train <- train_data[expand_grid_options$train_idx, !names(train_data) %in% drop]
train_std <- train_data_std[expand_grid_options$train_idx, ]
# Calculate different D's:
D_manhattan_scaled <- (abs(test - train) / train_std) %>% rowSums()
D_cosinus <- 1 - (rowSums(test * train) / (sqrt(rowSums(test^2) * rowSums(train^2))))
train_std[train_std < std_thresh] <- 1
D_manhattan_scaled_adj_std <- (abs(test - train) / train_std) %>% rowSums()
D_manhattan <- (abs(test - train)) %>% rowSums()
return(expand_grid_options %>%
dplyr::mutate(
D_manhattan_scaled = D_manhattan_scaled,
D_cosinus = D_cosinus,
D_manhattan_scaled_adj_std = D_manhattan_scaled_adj_std,
D_manhattan = D_manhattan,
isSame = test_data[test_idx, ]$user_id == train_data[train_idx, ]$user_id))
}
train_test_std_split <- function(data,
train_size,
test_size,
feature_selection) {
train_set <- data %>%
dplyr::ungroup() %>%
dplyr::arrange(time_stamp) %>%
dplyr::group_by(user_id) %>%
dplyr::filter(row_number() <= train_size) %>%
dplyr::ungroup()
if (length(feature_selection) > 1) {
# Manual:
# scaling_param_est <- scale_param_est_total_UG
scaling_param_est <- train_set %>%
dplyr::group_by(user_id) %>%
dplyr::summarize_at(vars(feature_selection), funs(mean, sd))
} else if (length(feature_selection) == 1) {
scaling_param_est <- train_set %>%
dplyr::group_by(user_id) %>%
dplyr::summarize_at(vars(feature_selection), funs(mean, sd)) %>%
dplyr::rename_at(vars("mean", "sd"),
funs(paste(feature_selection, ., sep = "_")))
}
train_set <- train_set %>%
dplyr::group_by(user_id) %>%
dplyr::mutate_at(vars(feature_selection), scale) %>%
data.table::as.data.table() %>%
dplyr::ungroup() %>%
dplyr::as_tibble() %>%
dplyr::arrange(time_stamp)
train_set_std <- train_set %>%
dplyr::left_join(train_set %>%
dplyr::group_by(user_id) %>%
dplyr::summarize_at(feature_selection, sd) %>%
dplyr::rename_at(vars(-"user_id"),
funs(paste0(feature_selection, "_sd"))), by = "user_id") %>%
dplyr::ungroup() %>%
dplyr::arrange(time_stamp) %>%
dplyr::select(matches("_sd"))
test_set_unscaled <- data %>%
dplyr::ungroup() %>%
dplyr::arrange(time_stamp) %>%
dplyr::filter(!(obs_id %in% train_set$obs_id)) %>%
dplyr::group_by(user_id) %>%
dplyr::filter(row_number() <= test_size) %>%
dplyr::ungroup()
# Manual:
# test_set_joined_with_scaling_params <- cbind(test_set_unscaled, scaling_param_est)
test_set_unscaled_joined_with_scaling_params <- test_set_unscaled %>%
dplyr::left_join(scaling_param_est, by = "user_id")
test_set_unscaled_joined_with_scaling_params[, feature_selection] <-
(test_set_unscaled_joined_with_scaling_params[, feature_selection] -
test_set_unscaled_joined_with_scaling_params[, paste0(feature_selection, "_mean")]) /
test_set_unscaled_joined_with_scaling_params[, paste0(feature_selection, "_sd")]
test_set <- test_set_unscaled_joined_with_scaling_params %>%
dplyr::select(user_id, obs_id, scroll_id,
time_stamp, row_num, scroll_length,
feature_selection)
# Validate:
# intersect(unique(test_set$obs_id), unique(train_set$obs_id))
# compute_std <- train_set %>%
# dplyr::group_by(user_id) %>%
# dplyr::select(-row_num) %>%
# dplyr::rename_at(vars(-user_id, -obs_id, -scroll_id,
# -time_stamp, -scroll_length),
# funs(paste(., "std", sep = "_"))) %>%
# dplyr::summarize_at(vars(matches("_std$")), funs(sd)) %>%
# dplyr::ungroup()
return(list("train_set" = train_set,
"train_set_std" = train_set_std,
"test_set" = test_set,
"test_set_unscaled" = test_set_unscaled))
}
build_dissimilarity_rank <- function(n_users,
train_set,
train_set_std,
test_set,
D_type = "D_cosinus") {
return(dissimilarity_wrapper(n_users, train_set, train_set_std, test_set) %>%
dplyr::mutate(train_user_id = train_set[train_idx, ]$user_id,
test_user_id = test_set[test_idx, ]$user_id) %>%
dplyr::select(test_idx,
train_user_id,
test_user_id,
train_idx,
D_manhattan_scaled,
D_cosinus,
D_manhattan_scaled_adj_std,
D_manhattan,
isSame) %>%
dplyr::group_by(test_idx, train_user_id) %>%
dplyr::arrange(train_user_id, !!rlang::sym(D_type)) %>%
dplyr::mutate(D_manhattan_rank = rank(D_manhattan),
D_manhattan_scaled_rank = rank(D_manhattan_scaled, ties.method = "first"),
D_cosinus_rank = rank(D_cosinus, ties.method = "first")) %>%
dplyr::ungroup())
}
build_param_est <- function(dissimilarity_rank,
K,
D_type_rank = "D_manhattan_scaled") {
return(dissimilarity_rank %>%
dplyr::filter(isSame, (!!rlang::sym(paste0(D_type_rank, "_rank"))) == K) %>%
dplyr::group_by(train_user_id) %>%
dplyr::summarise_at(vars(D_manhattan_scaled,
D_cosinus,
D_manhattan_scaled_adj_std,
D_manhattan),
funs(mean, median, sd, quantile(., probs = .9))) %>%
dplyr::rename_at(vars(matches("_quantile")),
funs(str_replace(., "_quantile", "_percentile_90"))) %>%
dplyr::rename_at(vars(matches("_sd")),
funs(str_replace(., "_sd", "_std")))
)
}
build_dissimilarity_table <- function(dissimilarity_rank,
param_est,
K,
i,
D_type_rank = "D_manhattan_scaled",
D_s = c("D_manhattan_scaled",
"D_cosinus",
"D_manhattan_scaled_adj_std",
"D_manhattan")) {
dissimilarity_table <- dissimilarity_rank %>%
dplyr::filter(isSame, (!!rlang::sym(paste0(D_type_rank, "_rank"))) == K) %>%
dplyr::left_join(param_est, by = c("train_user_id")) %>%
dplyr::ungroup()
dissimilarity_table[paste0(D_s, "_norm_standard")] <-
(dissimilarity_table[D_s] - dissimilarity_table[paste0(D_s, "_mean")]) /
dissimilarity_table[paste0(D_s, "_std")]
dissimilarity_table[paste0(D_s, "_norm_median")] <-
(dissimilarity_table[D_s] - dissimilarity_table[paste0(D_s, "_median")]) /
(dissimilarity_table[paste0(D_s, "_percentile_90")] - dissimilarity_table[paste0(D_s, "_median")])
# dplyr::mutate(experiment = i))
return(dissimilarity_table)
}
k_fold_data_prepare <- function(df, min_scroll_len = 3) {
# Given the data, split it by user id:
return(df %>%
dplyr::filter(scroll_length >= min_scroll_len) %>%
dplyr::arrange(time_stamp) %>%
dplyr::ungroup() %>%
split(.$user_id))
}
k_fold_engine <- function(df,
obs,
n_users,
K = 2,
feature_selection,
D_type = "D_cosinus") {
# Train - Test Split:
train_set <- df %>%
dplyr::arrange(time_stamp) %>%
dplyr::filter(obs_id != obs)
if (length(feature_selection) > 1) {
# Manual:
# scaling_param_est <- scale_param_est_total_UG
scaling_param_est <- train_set %>%
dplyr::arrange(time_stamp) %>%
dplyr::group_by(user_id) %>%
dplyr::summarize_at(vars(feature_selection),
funs(mean, sd))
} else if (length(feature_selection) == 1) {
scaling_param_est <- train_set %>%
dplyr::arrange(time_stamp) %>%
dplyr::group_by(user_id) %>%
dplyr::summarize_at(vars(feature_selection), funs(mean, sd)) %>%
dplyr::rename_at(vars("mean", "sd"),
funs(paste(feature_selection, ., sep = "_")))
}
train_set <- train_set %>%
dplyr::arrange(time_stamp) %>%
dplyr::group_by(user_id) %>%
dplyr::mutate_at(vars(feature_selection), scale) %>%
as.data.table() %>%
dplyr::ungroup() %>%
as_tibble()
# Compute std for each train variable:
train_set_std <- train_set %>%
dplyr::left_join(train_set %>%
dplyr::group_by(user_id) %>%
dplyr::summarize_at(feature_selection, sd) %>%
dplyr::rename_at(vars(-"user_id"),
funs(paste0(feature_selection, "_sd"))), by = "user_id") %>%
dplyr::select(matches("_sd"))
test_set <- df %>%
dplyr::filter(obs_id == obs)
test_set_joined_with_scaling_params <- test_set %>%
dplyr::left_join(scaling_param_est, by = "user_id") %>%
dplyr::arrange(time_stamp)
# Manual:
# test_set_joined_with_scaling_params <- cbind(test_set, scaling_param_est)
test_set_joined_with_scaling_params[, feature_selection] <-
(test_set_joined_with_scaling_params[, feature_selection] -
test_set_joined_with_scaling_params[, paste0(feature_selection, "_mean")]) /
test_set_joined_with_scaling_params[, paste0(feature_selection, "_sd")]
test_set <- test_set_joined_with_scaling_params %>%
dplyr::arrange(time_stamp) %>%
dplyr::select(user_id, obs_id, scroll_id,
time_stamp, row_num, scroll_length,
feature_selection)
# Compute std for each train variable:
# compute_std <- train_set %>%
# dplyr::group_by(user_id) %>%
# dplyr::select(-row_num) %>%
# dplyr::rename_at(vars(-user_id, -obs_id, -scroll_id,
# -time_stamp, -scroll_length),
# funs(paste(., "std", sep = "_"))) %>%
# dplyr::summarize_at(vars(matches("_std$")), funs(sd)) %>%
# dplyr::ungroup()
#
# train_set_std <- dplyr::left_join(train_set,
# compute_std,
# by = "user_id") %>%
# dplyr::ungroup() %>%
# dplyr::select(matches("_std$"))
# Compute the dissimilarities:
return(build_dissimilarity_rank(n_users,
train_set,
train_set_std,
test_set,
D_type))
}
k_fold_wrapper <- function(data_df,
K = 2,
D_type_rank = "D_cosinus",
feature_selection) {
data_seqed <- k_fold_data_prepare(data_df)
# Given the data splitted by user id, split it by observation id:
data_seqed_by_obs <- future_imap(data_seqed, ~split(., .$obs_id ))
# Get the observation ids per each splitted sub dataframe:
obs_ids <- future_imap(data_seqed_by_obs, ~as.integer(names(.)))
# Feed kfold engine with splitted data by user id and observations names:
plan(strategy = multicore)
dissimilarity_rank <- furrr::future_map_dfr(data_seqed, function(x) {
furrr::future_map_dfr(obs_ids[[as.character(x$user_id[1])]],
function(df,
obs,
n_users,
K,
feature_selection,
D_type_rank) {
k_fold_engine(df,
obs,
n_users,
K,
feature_selection,
D_type_rank) },
df = x, n_users = x$user_id[1],
K = K, feature_selection = feature_selection,
D_type = D_type_rank) } )
if(nrow(dissimilarity_rank[which(rowSums(is.na(dissimilarity_rank)) > 0), ])) {
dissimilarity_rank <- dissimilarity_rank[which(rowSums(is.na(dissimilarity_rank)) == 0), ] %>%
dplyr::mutate(row_num = row_number())
}
param_estimations <- dissimilarity_rank %>%
build_param_est(K, D_type_rank = D_type_rank)
# Summarize and return final param estimation (average):
# return(param_estimations %>%
# dplyr::group_by(train_user_id) %>%
# summarize_at(vars(-"train_user_id"), mean))
return(list("dissimilarity_rank" = dissimilarity_rank,
"param_estimations" = param_estimations))
}
The final script that causes the issues:
n_users <- max(unique(data$user_id))
train_df <- data %>%
dplyr::group_by(user_id) %>%
dplyr::filter(row_number() <= 50)
filter_users_low_amount_obs <- train_df %>%
dplyr::group_by(user_id) %>%
dplyr::summarise(n_obs = length(unique(obs_id))) %>%
dplyr::arrange(n_obs) %>%
dplyr::filter(n_obs >= 3) %>%
select(user_id)
train_df <- train_df %>%
filter(user_id %in% filter_users_low_amount_obs$user_id)
k_fold_d_rank_param_est <- k_fold_wrapper(train_df, K, D_type_rank = D_type, feature_selection)
dissimilarity_rank_1 <- k_fold_d_rank_param_est$dissimilarity_rank
param_est <- k_fold_d_rank_param_est$param_estimations
train_test_std_split_2 <- train_test_std_split(data,
train_size_2,
test_size = Inf,
feature_selection)
dissimilarity_rank_2 <- build_dissimilarity_rank(n_users,
train_test_std_split_2$train_set,
train_test_std_split_2$train_set_std,
train_test_std_split_2$test_set)

I believe that the option you are missing is the scheduling option for furrr. By default your data is split up into as many chunks as you have workers specified at the beginning of the future_map call and then each worker gets assigned one chunk to work on. Once a worker is done with it's chunk, it will look for another chunk and start working on that. If there are no more chunks left, the worker will go idle.
You can specify with the scheduling option into how many chunks your data should be split up per worker. For example .options = furrr_options(scheduling = 2) will create two chunks per worker and workers that finish early will start working on another chunk.
For more information here is a vignette on chunking
https://davisvaughan.github.io/furrr/articles/articles/chunking.html
PS: You have some nested future calls in your code, depending on your specified future::plan() this will only slow down the code

Trying to understand how eval(expr, envir = df) works

I have built a function which seems to work, but I don't understand why.
My initial problem was to take a data.frame which contains counts of a population and expand it to re-create the original population. This is easy enough if you know the column names in advance.
library(tidyverse)
set.seed(121)
test_counts <- tibble(Population = letters[1:4], Length = c(1,1,2,1),
Number = sample(1:100, 4))
expand_counts_v0 <- function(Length, Population, Number) {
tibble(Population = Population,
Length = rep(Length, times = Number))
}
test_counts %>% pmap_dfr(expand_counts_v0) %>% # apply it
group_by(Population, Length) %>% # test it
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
However, I wanted to generalise it to a function which didn't need to know at the column names of the data.frame, and I'm interested in NSE, so I wrote:
test_counts1 <- tibble(Population = letters[1:4],
Length = c(1,1,2,1),
Number = sample(1:100, 4),
Height = c(100, 50, 45, 90),
Width = c(700, 50, 60, 90)
)
expand_counts_v1 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% pmap_dfr(make_tbl)
}
But, when I test this function it seems to duplicate rows 4 times:
test_counts %>% expand_counts_v1(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ sum(.$Number)/sum(test_counts$Number)}
# [1] 4
This lead me to guess a solution, which was
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% make_tbl
}
This seems to work:
test_counts %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
test_counts1 %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length, Height, Width) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts1)}
# [1] TRUE
But I don't understand why. How is it evaluating for each row, even though I'm not using pmap anymore? The function needs to be applied to each row in order to work, so it must be somehow, but I can't see how it's doing that.
EDIT
After Artem's correct explanation of what was going on, I realised I could do this
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
expr(tibble(!!!cols)) %>% eval_tidy(data = df)
}
Which gets rid of the unnecessary mk_tbl function. However, as Artem said, that is only really working because rep is vectorised. So, it's working, but not by re-writing the _v0 function and pmapping it, which is the process I was trying to replicate. Eventually, I discovered, rlang::new_function and wrote:
expand_counts_v3 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
all_names <- df %>% names %>% map(as.name)
args <- rep(0, times = length(all_names)) %>% as.list %>% set_names(all_names)
correct_function <- new_function(args, # this makes the function as in _v0
expr(tibble(!!!cols)) )
pmap_dfr(df, correct_function) # applies it as in _v0
}
which is longer, and probably uglier, but works the way I originally wanted.

The issue is in eval( envir = df ), which exposes the entire data frame to make_tbl(). Notice that you never use ... argument inside make_tbl(). Instead, the function effectively computes the equivalent of
with( df, tibble(Population = rep(Population, times = Number),
Length = rep(Length, times=Number)) )
regardless of what arguments you provide to it. When you call the function via pmap_dfr(), it essentially computes the above four times (once for each row) and concatenates the results by-row, resulting in the duplication of entries you've observed. When you remove pmap_dfr(), the function is called once, but since rep is itself vectorized (try doing rep( test_counts$Population, test_counts$Number ) to see what I mean), make_tbl() computes the entire result in one go.

Combine list of data frames with one column of characters

I am learning to get, cleaning and combining data. I am confused why in a loop rbind command result in returning 10 data instead of expected 30 data as when I combine it manually (i by i).
library(XML)
mergeal <- NULL
tabnums <- 3
for (i in 1:length(tabnums)) {
bnn <- paste0("http://www.ngchanmau.com/listing_browse.php?cur_page=",
tabnums[i], "&&coming=22-Oct-2015&coming=22-Oct-2015")
tem <- readHTMLTable(bnn, header=T, stringsAsFactors=F)
#data cleaning
ff <- tem[8] #wanted data
ff1 <- as.data.frame(ff)
ff2 <- ff1[ , 1] #get 1st col data only
ff3 <- unique(ff2)
ff4 <- ff3[c(2,5:13)] #wanted list only
#merging dataset
mergeal <- rbind(mergeal, ff4)
}
I've tried using list rbind list of data frames with one column of characters and numerics but still have the same result as above. Appreciate any help on what I missed, thanks.

I cleaned up the data cause I was bored.
library(plyr)
library(XML)
library(dplyr)
library(magrittr)
library(stringi)
library(tidyr)
library(lubridate)
answer =
data_frame(tabnums = 1:3) %>%
group_by(tabnums) %>%
do(.$tabnums %>%
paste0("http://www.ngchanmau.com/listing_browse.php?cur_page=",
., "&&coming=22-Oct-2015&coming=22-Oct-2015") %>%
readHTMLTable(header = T, stringsAsFactors = F) %>%
extract2(8)) %>%
ungroup %>%
select(V1) %>%
distinct %>%
mutate(V1 =
V1 %>%
stri_replace_all_fixed("Â", "\n") %>%
stri_replace_all_fixed("Type:", "\nType:") %>%
stri_replace_all_fixed("Time:", "\nTime:") %>%
stri_replace_all_fixed("Area:", "\nArea:") %>%
stri_split_fixed("\n")) %>%
unnest(V1) %>%
mutate(V1 = V1 %>% stri_trim) %>%
filter(V1 %>% stri_detect_regex("^There are currently") %>% `!`) %>%
filter(V1 != "") %>%
separate(V1, c("variable", "value"), sep = ":", fill = "left") %>%
mutate(variable = variable %>% mapvalues(NA, "Description"),
ID = variable %>% `==`("Description") %>% cumsum) %>%
spread(variable, value) %>%
mutate(Area = Area %>% extract_numeric,
Price = Price %>% extract_numeric,
Datetime =
Time %>%
stri_replace_all_fixed("a.m.", "am") %>%
stri_replace_all_fixed("p.m.", "pm") %>%
paste(Date, .) %>%
dmy_hm) %>%
select(-Date, -Time)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do I use arrange inside a function? - r

Related

Dynamically change the column name created using summarise() and complete()

Assigning floor date in groupby fails

furrr package in R doesn't keep spreading the jobs across all cores?

Trying to understand how eval(expr, envir = df) works

Combine list of data frames with one column of characters

Categories

Resources