how to write result of t.test into a file? - r

How can i write the result of t.test into a file?
> x
[1] 12.2 10.8 12.0 11.8 11.9 12.4 11.3 12.2 12.0 12.3
> t.test(x)
One Sample t-test
data: x
t = 76.2395, df = 9, p-value = 5.814e-14
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
11.5372 12.2428
sample estimates:
mean of x
11.89
> write(t.test(x),file="test")
Error in cat(list(...), file, sep, fill, labels, append) :
argument 1 (type 'list') cannot be handled by 'cat'

> sink("out.txt")
> x <- scan()
1: 12.2 10.8 12.0 11.8 11.9 12.4 11.3 12.2 12.0 12.3
11:
Read 10 items
> t.test(x)
> sink()
> readLines("out.txt")
[1] ""
[2] "\tOne Sample t-test"
[3] ""
[4] "data: x "
[5] "t = 76.2395, df = 9, p-value = 5.814e-14"
[6] "alternative hypothesis: true mean is not equal to 0 "
[7] "95 percent confidence interval:"
[8] " 11.5372 12.2428 "
[9] "sample estimates:"
[10] "mean of x "
[11] " 11.89 "
[12] ""

The broom package was released after this post. It makes dealing with model outputs much, much nicer. In particular, the function tidy() will convert the model output to a dataframe for further handling.
x <- c(12.2, 10.8, 12.0, 11.8, 11.9, 12.4, 11.3, 12.2, 12.0, 12.3)
t_test <- t.test(x)
library(broom)
tidy_t_test <- broom::tidy(t_test)
tidy_t_test
#> # A tibble: 1 x 8
#> estimate statistic p.value parameter conf.low conf.high method
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 11.9 76.2 5.81e-14 9 11.5 12.2 One S…
#> # … with 1 more variable: alternative <chr>
write.csv(tidy_t_test, "out.csv")

Related

Mapping broom::tidy to nested list of {fixest} models and keep name of list element

I want to apply broom::tidy() to models nested in a fixest_multi object and extract the names of each list level as data frame columns. Here's an example of what I mean.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
This command estimates two models for each dep. var. (Ozone and Solar.R) for a subset of each Month plus the full sample. Here's how the resulting object looks like:
> names(multiple_est)
[1] "Full sample" "5" "6" "7" "8" "9"
> names(multiple_est$`Full sample`)
[1] "Ozone" "Solar.R"
I now want to tidy each model object, but keep the information of the Month / Dep.var. combination as columns in the tidied data frame. My desired output would look something like this:
I can run map_dfr from the tidyr package, giving me this result:
> map_dfr(multiple_est, tidy, .id ="Month") %>% head(9)
# A tibble: 9 x 6
Month term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Full sample (Intercept) -71.0 23.6 -3.01 3.20e- 3
2 Full sample Wind -3.06 0.663 -4.61 1.08e- 5
3 Full sample Temp 1.84 0.250 7.36 3.15e-11
4 5 (Intercept) -76.4 82.0 -0.931 3.53e- 1
5 5 Wind 2.21 2.31 0.958 3.40e- 1
6 5 Temp 3.07 0.878 3.50 6.15e- 4
7 6 (Intercept) -70.6 46.8 -1.51 1.45e- 1
8 6 Wind -1.34 1.13 -1.18 2.50e- 1
9 6 Temp 1.64 0.609 2.70 1.29e- 2
But this tidies only the first model of each Month, the model with the Ozone outcome.
My desired output would look something like this:
Month outcome term estimate more columns from tidy
Full sample Ozone (Intercept) -71.0
Full sample Ozone Wind -3.06
Full sample Ozone Temp 1.84
Full sample Solar.R (Intercept) some value
Full sample Solar.R Wind some value
Full sample Solar.R Temp some value
... rows repeated for each month 5, 6, 7, 8, 9
How can I apply tidy to all models and add another column that indicates the outcome of the model (which is stored in the name of the model object)?
So, fixest_mult has a pretty strange setup as I delved deeper. As you noticed, mapping across it or using apply just accesses part of the data frames. In fact, it isn't just the data frames for "Ozone", but actually just the data frames for the first 6 data frames (those for c("Full sample", "5", "6").
If you convert to a list, it access the data attribute, which is a sequential list of all 12 data frames, but dropping the relevant names you're looking for. So, as a workaround, could use pmap() and the names (found in the attributes of the object) to tidy() and then use mutate() for your desired columns.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
nms <- attr(multiple_est, "meta")$all_names
pmap_dfr(
list(
data = as.list(multiple_est),
month = rep(nms$sample, each = length(nms$lhs)),
outcome = rep(nms$lhs, length(nms$sample))
),
~ tidy(..1) %>%
mutate(
Month = ..2,
outcome = ..3,
.before = 1
)
)
#> # A tibble: 36 × 7
#> Month outcome term estimate std.error statistic p.value
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Full sample Ozone (Intercept) -71.0 23.6 -3.01 3.20e- 3
#> 2 Full sample Ozone Wind -3.06 0.663 -4.61 1.08e- 5
#> 3 Full sample Ozone Temp 1.84 0.250 7.36 3.15e-11
#> 4 Full sample Solar.R (Intercept) -76.4 82.0 -0.931 3.53e- 1
#> 5 Full sample Solar.R Wind 2.21 2.31 0.958 3.40e- 1
#> 6 Full sample Solar.R Temp 3.07 0.878 3.50 6.15e- 4
#> 7 5 Ozone (Intercept) -70.6 46.8 -1.51 1.45e- 1
#> 8 5 Ozone Wind -1.34 1.13 -1.18 2.50e- 1
#> 9 5 Ozone Temp 1.64 0.609 2.70 1.29e- 2
#> 10 5 Solar.R (Intercept) -284. 262. -1.08 2.89e- 1
#> # … with 26 more rows

Improve parallel performance with batching in a static-dynamic branching pipeline

BLUF: I am struggling to understand out how to use batching in the R targets package to improve performance in a static and dynamic branching pipeline processed in parallel using tar_make_future(). I presume that I need to batch within each dynamic branch but I am unsure how to go about doing that.
Here's a reprex that uses dynamic branching nested inside static branching, similar to what my actual pipeline is doing. It first branches statically for each value in all_types, and then dynamically branches within each category. This code produces 1,000 branches and 1,010 targets total. In the actual workflow I obviously don't use replicate, and the dynamic branches vary in number depending on the type value.
# _targets.r
library(targets)
library(tarchetypes)
library(future)
library(future.callr)
plan(callr)
all_types = data.frame(type = LETTERS[1:10])
tar_map(values = all_types, names = "type",
tar_target(
make_data,
replicate(100,
data.frame(x = seq(1000) + rnorm(1000, 0, 5),
y = seq(1000) + rnorm(1000, 20, 20)),
simplify = FALSE
),
iteration = "list"
),
tar_target(
fit_model,
lm(make_data),
pattern = map(make_data),
iteration = "list"
)
)
And here's a timing comparison of tar_make() vs tar_make_future() with eight workers:
# tar_destroy()
t1 <- system.time(tar_make())
# tar_destroy()
t2 <- system.time(tar_make_future(workers = 8))
rbind(serial = t1, parallel = t2)
## user.self sys.self elapsed user.child sys.child
## serial 2.12 0.11 25.59 NA NA
## parallel 2.07 0.24 184.68 NA NA
I don't think the user or system fields are useful here since the job gets dispatched to separate R processes, but the elapsed time for the parallel job takes about 7 times longer than the serial job.
I presume this slowdown is caused by the large number of targets. Will batching improve performance in this case, and if so how can I implement batching within the dynamic branch?
You are on the right track with batching. In your case, that is a matter of breaking up your list of 100 datasets into groups of, say, 10 or so. You could do this with a nested list of datasets, but that's a lot of work. Luckily, there is an easier way.
Your question is actually really well-timed. I just wrote some new target factories in tarchetypes that could help. To access them, you will need the development version of tarchetypes from GitHub:
remotes::install_github("ropensci/tarchetypes")
Then, with tar_map2_count(), it will be much easier to batch your list of 100 datasets for each scenario.
library(targets)
tar_script({
library(broom)
library(targets)
library(tarchetypes)
library(tibble)
make_data <- function(n) {
datasets_per_batch <- replicate(
100,
tibble(
x = seq(n) + rnorm(n, 0, 5),
y = seq(n) + rnorm(n, 20, 20)
),
simplify = FALSE
)
tibble(dataset = datasets_per_batch, rep = seq_along(datasets_per_batch))
}
tar_map2_count(
name = model,
command1 = make_data(n = rows),
command2 = tidy(lm(y ~ x, data = dataset)), # Need dataset[[1]] in tarchetypes 0.4.0
values = data_frame(
scenario = LETTERS[seq_len(10)],
rows = seq(10, 100, length.out = 10)
),
columns2 = NULL,
batches = 10
)
})
tar_make(reporter = "silent")
#> Warning message:
#> `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
tar_read(model)
#> # A tibble: 2,000 × 8
#> term estimate std.error statistic p.value scenario rows tar_group
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <int>
#> 1 (Intercept) 17.1 12.8 1.34 0.218 A 10 10
#> 2 x 1.39 1.35 1.03 0.333 A 10 10
#> 3 (Intercept) 6.42 14.0 0.459 0.658 A 10 10
#> 4 x 1.75 1.28 1.37 0.209 A 10 10
#> 5 (Intercept) 32.8 7.14 4.60 0.00176 A 10 10
#> 6 x -0.300 1.14 -0.263 0.799 A 10 10
#> 7 (Intercept) 29.7 3.24 9.18 0.0000160 A 10 10
#> 8 x 0.314 0.414 0.758 0.470 A 10 10
#> 9 (Intercept) 20.0 13.6 1.47 0.179 A 10 10
#> 10 x 1.23 1.77 0.698 0.505 A 10 10
#> # … with 1,990 more rows
Created on 2021-12-10 by the reprex package (v2.0.1)
There is also tar_map_rep(), which may be easier if all your datasets are randomly generated, but I am not sure if I am overfitting your use case.
library(targets)
tar_script({
library(broom)
library(targets)
library(tarchetypes)
library(tibble)
make_one_dataset <- function(n) {
tibble(
x = seq(n) + rnorm(n, 0, 5),
y = seq(n) + rnorm(n, 20, 20)
)
}
tar_map_rep(
name = model,
command = tidy(lm(y ~ x, data = make_one_dataset(n = rows))),
values = data_frame(
scenario = LETTERS[seq_len(10)],
rows = seq(10, 100, length.out = 10)
),
batches = 10,
reps = 10
)
})
tar_make(reporter = "silent")
#> Warning message:
#> `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
tar_read(model)
#> # A tibble: 2,000 × 10
#> term estimate std.error statistic p.value scenario rows tar_batch tar_rep
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <int> <int>
#> 1 (Inter… 37.5 7.50 5.00 0.00105 A 10 1 1
#> 2 x -0.701 1.17 -0.601 0.564 A 10 1 1
#> 3 (Inter… 21.5 9.64 2.23 0.0567 A 10 1 2
#> 4 x -0.213 1.55 -0.138 0.894 A 10 1 2
#> 5 (Inter… 20.6 9.51 2.17 0.0620 A 10 1 3
#> 6 x 1.40 1.79 0.783 0.456 A 10 1 3
#> 7 (Inter… 11.6 11.2 1.04 0.329 A 10 1 4
#> 8 x 2.34 1.39 1.68 0.131 A 10 1 4
#> 9 (Inter… 26.8 9.16 2.93 0.0191 A 10 1 5
#> 10 x 0.288 1.10 0.262 0.800 A 10 1 5
#> # … with 1,990 more rows, and 1 more variable: tar_group <int>
Created on 2021-12-10 by the reprex package (v2.0.1)
Unfortunately, futures do come with overhead. Maybe it will be faster in your case if you try tar_make_clustermq()?

How to index the features() function to iterate over a list of data frames using map() function in R?

Plotting my soil compaction data gives a convex-up curve. I need to determine the maximum y-value and the x-value which produces that maximum.
The 'features' package fits a smooth spline to the data and returns the features of the spline, including the y-maximum and critical x-value. I am having difficulty iterating the features() function over multiple samples, which are contained in a tidy list.
It seems that the features package is having trouble indexing to the data. The code works fine when I use data for only one sample, but when I try to use the dot placeholder and square brackets it loses track of the data.
Below is the code showing how this process works correctly for one sample, but not for an iteration.
#load packages
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: package 'forcats' was built under R version 3.6.3
library(features)
#> Warning: package 'features' was built under R version 3.6.3
#> Loading required package: lokern
#> Warning: package 'lokern' was built under R version 3.6.3
# generate example data
df <- tibble(
sample = (rep(LETTERS[1:3], each=4)),
w = c(seq(0.08, 0.12, by=0.0125),
seq(0.09, 0.13, by=0.0125),
seq(0.10, 0.14, by=0.0125)),
d= c(1.86, 1.88, 1.88, 1.87,
1.90, 1.92, 1.92, 1.91,
1.96, 1.98, 1.98, 1.97) )
df
#> # A tibble: 12 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 A 0.08 1.86
#> 2 A 0.0925 1.88
#> 3 A 0.105 1.88
#> 4 A 0.118 1.87
#> 5 B 0.09 1.9
#> 6 B 0.102 1.92
#> 7 B 0.115 1.92
#> 8 B 0.128 1.91
#> 9 C 0.1 1.96
#> 10 C 0.112 1.98
#> 11 C 0.125 1.98
#> 12 C 0.138 1.97
# use the 'features' package to fit a smooth spline and extract the spline features,
# including local y-maximum and critical point along x-axis.
# This works fine for one sample at a time:
sample1_data <- df %>% filter(sample == 'A')
sample1_features <- features(x= sample1_data$w,
y= sample1_data$d,
smoother = "smooth.spline")
sample1_features
#> $f
#> fmean fmin fmax fsd noise
#> 1.880000e+00 1.860000e+00 1.880000e+00 1.000000e-02 0.000000e+00
#> snr d1min d1max fwiggle ncpts
#> 2.707108e+11 -9.100000e-01 1.970000e+00 9.349000e+01 1.000000e+00
#>
#> $cpts
#> [1] 0.1
#>
#> $curvature
#> [1] -121.03
#>
#> $outliers
#> [1] NA
#>
#> attr(,"fits")
#> attr(,"fits")$x
#> [1] 0.0800 0.0925 0.1050 0.1175
#>
#> attr(,"fits")$y
#> [1] 1.86 1.88 1.88 1.87
#>
#> attr(,"fits")$fn
#> [1] 1.86 1.88 1.88 1.87
#>
#> attr(,"fits")$d1
#> [1] 1.9732965 0.8533784 -0.5868100 -0.9061384
#>
#> attr(,"fits")$d2
#> [1] 4.588832e-03 -1.791915e+02 -5.123866e+01 1.461069e-01
#>
#> attr(,"class")
#> [1] "features"
# But when attempting to use the pipe and the map() function
# to iterate over a list containing data for multiple samples,
# using the typical map() placeholder dot will not index to the
# list element/columns that are being passed to .f
df_split <- split(df, f= df[['sample']])
df_split
#> $A
#> # A tibble: 4 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 A 0.08 1.86
#> 2 A 0.0925 1.88
#> 3 A 0.105 1.88
#> 4 A 0.118 1.87
#>
#> $B
#> # A tibble: 4 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 B 0.09 1.9
#> 2 B 0.102 1.92
#> 3 B 0.115 1.92
#> 4 B 0.128 1.91
#>
#> $C
#> # A tibble: 4 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 C 0.1 1.96
#> 2 C 0.112 1.98
#> 3 C 0.125 1.98
#> 4 C 0.138 1.97
df_split %>% map(.f = features, x = .[['w']], y= .[['d']], smoother = "smooth.spline")
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> Error in seq.default(min(x), max(x), length = max(npts, length(x))): 'from' must be a finite number
Created on 2020-04-04 by the reprex package (v0.3.0)
You could use group_split to split the data based on sample and use map to apply features functions to each subset of data.
library(features)
library(dplyr)
library(purrr)
list_model <- df %>%
group_split(sample) %>%
map(~features(x = .x$w, y = .x$d, smoother = "smooth.spline"))

How to run a paired t-test on different levels of a categorical variable?

I am trying to run a paired t-test on pre- and post-intervention results of three intervention types. I am trying to run the the test on each intervention separately using "subset" in t.test function but it keeps running the test on the whole sample. I cannot separate the intervention levels manually as this is a large database and I do not have access to the excel file. Does anyone have any suggestions?
Here's the codes I am using:
Treatment (intervention) levels:"Passive" "Pro" "Peer"
"Post" and "Pre" are continuous variables.
t.test(data$Post, data$Pre, paired=T, subset=data$Treatment=="Peer")
t.test(data$Post, data$Pre, paired=T, subset=data$Treatment=="Pro")
t.test(data$Post, data$Pre, paired=T, subset=data$Treatment=="Passive")
There is no subset argument (nor a data argument) for the t.test function when using the default method:
> args(stats:::t.test.default)
function (x, y = NULL, alternative = c("two.sided", "less",
"greater"), mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
You'll have to subset first,
with(subset(data, subset=Treatment=="Peer"),
t.test(Post, Pre, paired=TRUE)
)
There's also an easier way using dplyr and broom...
library(dplyr)
library(broom)
data %>%
group_by(Treatment) %>%
do(tidy(t.test(.$Pre, .$Post, paired=TRUE)))
Reproducible example:
set.seed(123)
data <- tibble(id=1:63, Pre=rnorm(21*3,10,5), Post=rnorm(21*3,13,5),
Treatment=sample(c("Peer","Pro","Passive"), 63, TRUE))
data
# A tibble: 63 x 4
id Pre Post Treatment
<int> <dbl> <dbl> <chr>
1 1 7.20 7.91 Pro
2 2 8.85 7.64 Peer
3 3 17.8 14.5 Peer
4 4 10.4 15.2 Peer
5 5 10.6 13.3 Passive
6 6 18.6 17.6 Passive
7 7 12.3 23.3 Pro
8 8 3.67 10.5 Peer
9 9 6.57 1.45 Pro
10 10 7.77 18.0 Passive
# ... with 53 more rows
Output:
# A tibble: 3 x 9
# Groups: Treatment [3]
Treatment estimate statistic p.value parameter conf.low conf.high method alternative
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Passive -2.41 -1.72 0.107 14 -5.42 0.592 Paired t-~ two.sided
2 Peer -3.61 -2.96 0.00636 27 -6.11 -1.10 Paired t-~ two.sided
3 Pro -1.22 -0.907 0.376 19 -4.03 1.59 Paired t-~ two.sided

skimr: how to remove histogram?

I want to use the function skim from R package skimr on Windows. Unfortunately, in many situations column, hist is printed incorrectly (with many <U+2587>-like symbols), as in the example below.
Question: is there an easy way to either disable column "hist" and prevent it from being printed or prevent it from being calculated at all? Is there an option like hist = FALSE?
capture.output(skimr::skim(iris))
#> [1] "Skim summary statistics"
#> [2] " n obs: 150 "
#> [3] " n variables: 5 "
#> [4] ""
#> [5] "-- Variable type:factor ------------------------------------------------------------------------"
#> [6] " variable missing complete n n_unique top_counts"
#> [7] " Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0"
#> [8] " ordered"
#> [9] " FALSE"
#> [10] ""
#> [11] "-- Variable type:numeric -----------------------------------------------------------------------"
#> [12] " variable missing complete n mean sd p0 p25 p50 p75 p100"
#> [13] " Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9"
#> [14] " Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5"
#> [15] " Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9"
#> [16] " Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4"
#> [17] " hist"
#> [18] " <U+2587><U+2581><U+2581><U+2582><U+2585><U+2585><U+2583><U+2581>"
#> [19] " <U+2587><U+2581><U+2581><U+2585><U+2583><U+2583><U+2582><U+2582>"
#> [20] " <U+2582><U+2587><U+2585><U+2587><U+2586><U+2585><U+2582><U+2582>"
#> [21] " <U+2581><U+2582><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>"
Changing the locale to Chinese (as in this answer) does not solve the problem, but makes it worse:
Sys.setlocale(locale = "Lithuanian")
df <- data.frame(x = 1:5, y = c("Ą", "Č", "Ę", "ū", "ž"))
Sys.setlocale(locale = "Chinese")
capture.output(skimr::skim(df))
#> Error in substr(names(x), 1, options$formats$.levels$max_char) : invalid multibyte string at '<c0>'
skim_with(numeric = list(hist = NULL)) This is in the "Using Skimr" vignette.
You could also use skim_without_charts instead of skim.
More details in the docs here:
https://www.rdocumentation.org/packages/skimr/versions/2.0.2/topics/skim
Also keep in mind that the output from skimr is a dataframe so you can do:
# I'm using tidyverse here
iris %>%
skim() %>%
select(-numeric.hist)
The catch is that the name of the column is not hist but numeric.hist.
I actually got to this question because I wanted to do the opposite: keep only the histograms.

Resources