Can you pipe data into a pairwise.t.test? - r

I'm wondering if the following code can be simplified to allow the data to be piped directly from the summarise command to the pairwise.t.test, without creating the intermediary object?
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT))
pairwise.t.test(x = data_for_PTT$meanRT, g = data_for_PTT$TT, paired = TRUE)
I tried x = .$meanRT but it didn't like it, returning:
Error in match.arg(p.adjust.method) :
'arg' must be NULL or a character vector

You can use curly braces:
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT)) %>%
{pairwise.t.test(x = .$meanRT, g = .$TT, paired = TRUE)}
Reproducible:
df <- data.frame(X1 = runif(1000), X2 = runif(1000), subj = rep(c("A", "B")))
df %>%
{pairwise.t.test(.$X1, .$subj, paired = TRUE)}

Related

Vectorization to extract and bind very nested data

I have some very nested data. Within my list-column-dataframes, there are some pieces I need to put together and I've done so in a single instance to get my desired dataframe:
a <- df[[2]][["result"]]#data
b <- df[[2]][["result"]]#coords
desired_df <- cbind(a, b)
My original Large list has 171 elements, meaning I have 1:171 (3.3 GB) to go inside those square brackets and would ideally end up with 171 desired dataframes (which I would then bind all together).
I haven't needed to write a loop in 10 years, but I don't see a tidyverse way to deal with this. I also no longer know how to write loops. There are definitely some elements in there that are junk and will fail.
You haven't provided any sort of minimal example of the data.
I've condensed it to mean something like this
base_data <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
base_data2 = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
rownames(base_data2) = c("d", "e", "f")
methods::setClass(
"weird_object",
slots = c(data = "data.frame", coords = "matrix"),
prototype = list(data = base_data, coords = base_data2)
)
df <- list(
list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
)
)
And if I had such a list with these objects, then I could do
df %>%
map(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value)
But the selecting / hoisting function might fail, thus
one can wrap it in a purrr::possibly, and
choose a reasonable default:
df %>%
map(possibly(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
},
otherwise = list(data = NA, coords = NA))) %>%
enframe() %>%
unnest_wider(value)
Hopefully, this could be a step forward.
Next step is probably something resembling this:
df %>%
map(. %>% {
list(data = .$result#data,
coords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value) %>%
mutate(coords = coords %>% map(. %>% as_tibble(rownames = "rowid"))) %>%
unnest(cols = c(data, coords)) %>%
#' rotating the thing now
pivot_longer(cols = c(group, rowid),
names_to = "var_name",
values_to = "var") %>%
select(-var_name) %>%
pivot_longer(cols = c(var1, var2, V1, V2, V3),
names_to = "var_name") %>%
pivot_wider(names_from = var, values_from = value) %>%
identity()
If I understand your data structure, which I probably don't, you could do:
library(tidyverse)
# Create dummy data
df <- mtcars
df$mpg <- list(result = I(list('test')))
df$mpg$result <- list("#data" = I(list('your data')))
df <- df %>% select(mpg, cyl)
df1 <- df
df2 <- df
# Pull data you're interested in.
# The index is 1 here, instead of 2, because it's fake data and not your data.
# Assuming the # is not unique, and is just parsed from JSON or some other format.
dont_at_me <- function(x){
a <- x[[1]][["result"]][["#data"]]
a
}
# Get a list of all of your data.frames
all_dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
# Vectorize
purrr::map(all_dfs, ~dont_at_me(.))

Error in is.data.frame(.l) : object 'group' not found

Not sure if you all will be able to help me without reproducible example data, but I have a problem with running the code below. I am attempting to use the multidplyr package, but it doesn't seem to find my columns. I am running the code below:
cl <- detectCores()
cl
models_prep <-
bookings_prep %>%
inner_join(pipeline_prep_, by = c("booking_type", "group")) %>%
crossing(biz_day) %>%
left_join(closed_pipeline, by = c("booking_type", "group")) %>%
select(-opportunity_forecast_category)
group1 <- rep(1:cl, length.out = nrow(models_prep))
models_prep1 <- bind_cols(tibble(group1), models_prep)
cluster <- new_cluster(cl)
cluster %>%
cluster_library("tidyr")
cluster %>%
cluster_library("purrr")
cluster %>%
cluster_library("plyr")
cluster %>%
cluster_library("dplyr")
cluster_copy(cluster, "rmf")
cluster_copy(cluster, "fc_xreg")
#cluster_assign(cluster, "rmf")
#cluster_copy(cluster,c("rmf","fc_xreg"))
by_group <- models_prep %>%
group_by(group) %>%
partition(cluster)
by_group1 <- models_prep1 %>%
group_by(group1) %>%
partition(cluster)
models <- by_group %>%
mutate(
xreg_arima = pmap(list(data = pipeline, h = 1,name = group, bookings = bookings, type = booking_type,
biz_day = biz_day, no_bookings = no_bookings,
sparse_pipeline = sparse_pipeline,
closed_forecast_cat = pipeline_amount, FUN = "fc_xreg"), rmf))
Everything runs up to models <- correctly, but it fails there saying it cannot find the object group. Here is what the by_group data frame looks like.
Sometimes arguments just need to be quoted, particularly in dplyr-ish situations.
models <- by_group %>%
mutate(
xreg_arima = pmap(list(data = pipeline, h = 1,name = "group", bookings = "bookings", type = "booking_type",
biz_day = "biz_day", no_bookings = "no_bookings",
sparse_pipeline = "sparse_pipeline",
closed_forecast_cat = "pipeline_amount", FUN = "fc_xreg"), rmf))

STRING_ELT() can only be applied to a 'character vector', not a 'integer'

I have two dataframes a and b and want to compare certain columns between eachother. Everything worked just fine, until an error came up :
Error in mutate_impl(.data, dots) :
Evaluation error: STRING_ELT() can only be applied to a 'character vector', not a 'integer'.
my code:
library(RecordLinkage)
library(dplyr)
lookup <- expand.grid(target = a$NAME, source = b$WHOLE_NAME, stringsAsFactors = FALSE)
y <-lookup %>% group_by(target) %>%
mutate(match_score = jarowinkler(target, source)) %>%
summarise(match = match_score[which.max(match_score)], matched_to =
source[which.max(match_score)]) %>%
inner_join(b, by = c("matched_to" = "WHOLE_NAME"))
Without example data is it difficult to know for sure, but I can reproduce the error when the column with the names in b and/or a is a factor.
One solution is to use the stringdist function from the package stringdist:
a <- data.frame(names = c("foo", "bar", "aargh"), stringsAsFactors = FALSE)
b <- data.frame(wholename= c("foob", "baar", "flierp"), stringsAsFactors = FALSE)
lookup <- expand.grid(target = a$names, source = b$wholename, stringsAsFactors = FALSE)
y <-lookup %>% group_by(target) %>%
mutate(match_score = stringdist::stringdist(target, source, method = "jw")) %>%
summarise(match = match_score[which.max(match_score)], matched_to =
source[which.max(match_score)]) %>%
inner_join(b, by = c("matched_to" = "wholename"))
Another solution is to use the reclin package (of which I am the author):
library(reclin)
names(b) <- "names"
pair_blocking(a, b) %>%
compare_pairs(by = c("names"), default_comparator = jaro_winkler()) %>%
select_n_to_m(weight = "names") %>%
link()

How to get formulas of multiple regressions by vectorizing

Suppose I have the following code that makes multiple regressions and stores the lm and lm with stepwise selection models in tibbles:
library(dplyr)
library(tibble)
library(MASS)
set.seed(1)
df <- data.frame(A = sample(3, 10, replace = T),
B = sample(100, 10, replace = T),
C = sample(100, 10, replace = T))
df <- df %>% arrange(A)
formula_df <- as.tibble(NA)
aic_df <- as.tibble(NA)
for (i in unique(df$A)){
temp <- df %>% filter(A == i)
formula_df[i, 1] <- temp %>%
do(model = lm(B ~ C, data = .))
aic_df[i, 1] <- temp %>%
do(model = stepAIC(formula_df[[1,1]], direction = "both", trace = F))
}
Is it possible to vectorize to make it faster, for example using the *pply functions? The loop becomes extremely slow when the data gets larger. Thank you in advance.
You could try something like:
model <- df %>% group_by(A) %>%
summarise(formula_model = list(lm(B ~ C))) %>%
mutate(aic_model = list(stepAIC(.[[1,2]], direction = "both", trace = F)))

Difference between indexing with $ and [[]] ???

i have a question regarding indexing a dataframe in R. This is the Code:
Gewicht <- data %>%
group_by(data[[376]]) %>%
summarise(weights = mean(data[[10190]], na.rm = TRUE))
Gewicht2 <- data %>%
group_by(data[[376]]) %>%
summarise(weights = mean(Weights, na.rm = TRUE))
a <- seq(1:10)
b <- rep(c("male", "female"),5)
c <- seq(1:10)
data <- as.data.frame(cbind(a,b,c))
data$c <- as.numeric(data$c)
newdata <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(c, na.rm = TRUE))
newdata2 <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(data[[3]], na.rm = TRUE))
print(newdata)
print(newdata2)
I get different results for both dataframes. The desired result in the "newdata". Can you tell me WHY i get different values for these two calculations?
I need brackets for a more complex custom function, but it seems it writes the mean for the whole dataframe, where i would hope to get the mean for each group.
How to use [] or [[]] correctly here?
a <- c(1,2,3,4,5,6,7,8,9,10)
b <- rep(c("male", "female"),5)
c <- c(1,2,3,4,5,6,7,8,9,10)
data <- data.frame(cbind(a,b,c))
data$c <- as.numeric(as.character(data$c))
c
data$c
print(newdata)
print(newdata2)
newdata <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(c, na.rm = TRUE))
newdata2 <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(data[[3]], na.rm = TRUE))
newdata
newdata2
updated code, still different results :(
Gewicht <- aggregate(data[[varGewicht]], by=list(data[[varx]]), FUN=mean, na.rm = TRUE)
Aggregate function works :-)

Resources