I have a min to max custom ordination "Class_0_1","Class_1_3","Class_3_9", "Class_9_25","Class_25_50","Class_50"
library(dplyr)
my.ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/test_ants.csv")
my.ds$ClassType <- cut(my.ds$AT,breaks=c(-Inf,1,2.9,8.9,24.9,49.9,Inf),
right=FALSE,labels=c("Class_0_1","Class_1_3","Class_3_9",
"Class_9_25","Class_25_50","Class_50"))
my.ds%>% group_by(nest,ClassType)%>% summarize(avg=mean(AT))
# A tibble: 14 x 3
# Groups: nest [7]
nest ClassType avg
<int> <fct> <dbl>
1 2 Class_9_25 19.0
2 3 Class_0_1 0.776
3 3 Class_9_25 12.4
4 3 Class_25_50 29.4
5 4 Class_1_3 2.42
6 4 Class_9_25 17.0
7 7 Class_9_25 18.2
8 7 Class_25_50 33.1
9 10 Class_3_9 5.22
10 10 Class_9_25 13.6
11 10 Class_25_50 38.9
12 10 Class_50 110.
13 1066 Class_0_1 0.111
14 1067 Class_0_1 0.436
I'd like to repeat the last mean value inside the intermediate absent ClassType by nest. The desirable output for nest 3 for example is:
nest ClassType avg
<int> <fct> <dbl>
...
3 Class_0_1 0.776
3 Class_1_3 0.776
3 Class_3_9 0.776
3 Class_9_25 12.4
3 Class_25_50 29.4
...
#
You may try using complete and fill
my.ds %>%
group_by(nest,ClassType)%>%
summarize(avg=mean(AT)) %>%
complete(ClassType, fill = list(avg = NA)) %>%
fill(avg, .direction = "downup")
nest ClassType avg
<int> <fct> <dbl>
1 2 Class_0_1 19.0
2 2 Class_1_3 19.0
3 2 Class_3_9 19.0
4 2 Class_9_25 19.0
5 2 Class_25_50 19.0
6 2 Class_50 19.0
7 3 Class_0_1 0.776
8 3 Class_1_3 0.776
9 3 Class_3_9 0.776
10 3 Class_9_25 12.4
# … with 32 more rows
Related
I'm trying to run a repeated measures ANCOVA. The following code works fine:
tidy(aov(FA ~ sex * study + Error(PartID), data = DTI.TRACTlong))
Where FA is a continuous measure, sex and study are factors where study indicates (time 1 or time 2) and PartID is the individual id. However, I have to run this analysis for a number of regions (ROI) for two different conditions (harmonized vs. not). This seems easy enough using tidyverse with group_by (see below), but it throws Error in Error(.$PartID) : could not find function "Error". Any idea what is happening here? Why is the Error function recognized when used on its own but not when using tidyverse with group_by?
group_by(harmonize,ROI) %>%
do(tidy(aov(.$FA ~ .$GOBS_Gender * .$study + Error(.$PartID))))
Specify the data and remove the .$
library(broom)
library(dplyr)
...
group_by(harmonize,ROI) %>%
do(tidy(aov(FA ~ sex * study + Error(PartID), data = .)))
Another option may be to use to nest_by
library(tidyr)
...
nest_by(harmonize, ROI) %>%
mutate(out = list(tidy(aov(FA ~ sex * study + Error(PartID), data = data)))) %>%
select(-data) %>%
unnest(c(out))
Using a reproducible example
> data(npk)
> npk$grp <- rep(c('a', 'b'), each = 12)
-do method
> npk %>%
group_by(grp) %>%
do(tidy(aov(yield ~ N*P*K + Error(block), data = .)))
# A tibble: 18 x 8
# Groups: grp [2]
grp stratum term df sumsq meansq statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a block N:P:K 1 69.0 69.0 3.12 0.328
2 a block Residuals 1 22.1 22.1 NA NA
3 a Within N 1 119. 119. 6.89 0.0786
4 a Within P 1 0.270 0.270 0.0156 0.908
5 a Within K 1 58.1 58.1 3.36 0.164
6 a Within N:P 1 12.2 12.2 0.705 0.463
7 a Within N:K 1 23.4 23.4 1.35 0.329
8 a Within P:K 1 44.6 44.6 2.58 0.207
9 a Within Residuals 3 51.8 17.3 NA NA
10 b block N:P:K 1 29.3 29.3 0.431 0.630
11 b block Residuals 1 67.9 67.9 NA NA
12 b Within N 1 73.0 73.0 57.3 0.00478
13 b Within P 1 21.3 21.3 16.7 0.0264
14 b Within K 1 38.2 38.2 29.9 0.0120
15 b Within N:P 1 8.52 8.52 6.68 0.0814
16 b Within N:K 1 31.5 31.5 24.7 0.0156
17 b Within P:K 1 47.3 47.3 37.1 0.00888
18 b Within Residuals 3 3.82 1.27 NA NA
-group_modify #andrew reece
> npk %>%
group_by(grp) %>%
group_modify(~aov(yield ~ N*P*K + Error(block), data = .) %>%
tidy)
# A tibble: 18 x 8
# Groups: grp [2]
grp stratum term df sumsq meansq statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a block N:P:K 1 69.0 69.0 3.12 0.328
2 a block Residuals 1 22.1 22.1 NA NA
3 a Within N 1 119. 119. 6.89 0.0786
4 a Within P 1 0.270 0.270 0.0156 0.908
5 a Within K 1 58.1 58.1 3.36 0.164
6 a Within N:P 1 12.2 12.2 0.705 0.463
7 a Within N:K 1 23.4 23.4 1.35 0.329
8 a Within P:K 1 44.6 44.6 2.58 0.207
9 a Within Residuals 3 51.8 17.3 NA NA
10 b block N:P:K 1 29.3 29.3 0.431 0.630
11 b block Residuals 1 67.9 67.9 NA NA
12 b Within N 1 73.0 73.0 57.3 0.00478
13 b Within P 1 21.3 21.3 16.7 0.0264
14 b Within K 1 38.2 38.2 29.9 0.0120
15 b Within N:P 1 8.52 8.52 6.68 0.0814
16 b Within N:K 1 31.5 31.5 24.7 0.0156
17 b Within P:K 1 47.3 47.3 37.1 0.00888
18 b Within Residuals 3 3.82 1.27 NA NA
-nest_by method
> npk %>%
nest_by(grp) %>%
mutate(out = list(tidy(aov(yield ~ N*P*K + Error(block),
data = data)))) %>%
select(-data) %>%
unnest(out)
# A tibble: 18 x 8
# Groups: grp [2]
grp stratum term df sumsq meansq statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a block N:P:K 1 69.0 69.0 3.12 0.328
2 a block Residuals 1 22.1 22.1 NA NA
3 a Within N 1 119. 119. 6.89 0.0786
4 a Within P 1 0.270 0.270 0.0156 0.908
5 a Within K 1 58.1 58.1 3.36 0.164
6 a Within N:P 1 12.2 12.2 0.705 0.463
7 a Within N:K 1 23.4 23.4 1.35 0.329
8 a Within P:K 1 44.6 44.6 2.58 0.207
9 a Within Residuals 3 51.8 17.3 NA NA
10 b block N:P:K 1 29.3 29.3 0.431 0.630
11 b block Residuals 1 67.9 67.9 NA NA
12 b Within N 1 73.0 73.0 57.3 0.00478
13 b Within P 1 21.3 21.3 16.7 0.0264
14 b Within K 1 38.2 38.2 29.9 0.0120
15 b Within N:P 1 8.52 8.52 6.68 0.0814
16 b Within N:K 1 31.5 31.5 24.7 0.0156
17 b Within P:K 1 47.3 47.3 37.1 0.00888
18 b Within Residuals 3 3.82 1.27 NA NA
I have the dataset below:
> head(GLM_df)
# A tibble: 6 x 7
# Groups: hour [6]
hour Feeding Foraging Standing ID Area Feeding_Foraging
<int> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 0 3.61 23.2 1 41361 Seronera 26.8
2 1 2.85 24.2 1 41361 Seronera 27.0
3 2 2.5 24.3 2 41361 Seronera 26.8
4 3 6.92 18.6 3.89 41361 Seronera 25.6
5 4 7.5 17.6 3.78 41361 Seronera 25.1
6 5 7.26 19.6 2.45 41361 Seronera 26.8
And would like to round off numbers in columns Standing and Feeding_Foraging. I'm using as.integrer() as follows:
> GLM_df$Standing<-as.integer(GLM_df$Standing)
> GLM_df$Feeding_Foraging<-as.integer(GLM_df$Feeding_Foraging)
> head(GLM_df)
# A tibble: 6 x 7
# Groups: hour [6]
hour Feeding Foraging Standing ID Area Feeding_Foraging
<int> <dbl> <dbl> <int> <chr> <chr> <int>
1 0 3.61 23.2 1 41361 Seronera 26
2 1 2.85 24.2 1 41361 Seronera 27
3 2 2.5 24.3 2 41361 Seronera 26
4 3 6.92 18.6 3 41361 Seronera 25
5 4 7.5 17.6 3 41361 Seronera 25
6 5 7.26 19.6 2 41361 Seronera 26
However, this is rounding 26.8 into 26 and I would like 26.8 to be rounded up to 27. On the other hand, I would like values having less than a 0.5 decimal component (ex: 25.1) to be rounded down as using as.integrer() (ex: 25).
Is there a function I can use for that or do I need code?
Any input is appreciated!
I am trying to figure out if I can use the list of arguments provided to purrr::pmap() to also name the elements of the output list from this function using purrr::set_names().
For example, here is a simple example where I am using pmap to create summary for some variables from different dataframes across grouping variables.
# setup
library(tidyverse)
library(groupedstats)
set.seed(123)
# creating the dataframes
data_1 <- tibble::as.tibble(iris)
data_2 <- tibble::as.tibble(mtcars)
data_3 <- tibble::as.tibble(airquality)
# creating a list
purrr::pmap(
.l = list(
data = list(data_1, data_2, data_3),
grouping.vars = alist(Species, c(am, cyl), Month),
measures = alist(c(Sepal.Length, Sepal.Width), wt, c(Ozone, Solar.R, Wind))
),
.f = groupedstats::grouped_summary
) %>% # assigning names to each element of the list
purrr::set_names(x = ., nm = alist(data_1, data_2, data_3))
# output
#> $data_1
#> # A tibble: 6 x 16
#> Species type variable missing complete n mean sd min p25
#> <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa nume~ Sepal.L~ 0 50 50 5.01 0.35 4.3 4.8
#> 2 setosa nume~ Sepal.W~ 0 50 50 3.43 0.38 2.3 3.2
#> 3 versic~ nume~ Sepal.L~ 0 50 50 5.94 0.52 4.9 5.6
#> 4 versic~ nume~ Sepal.W~ 0 50 50 2.77 0.31 2 2.52
#> 5 virgin~ nume~ Sepal.L~ 0 50 50 6.59 0.64 4.9 6.23
#> 6 virgin~ nume~ Sepal.W~ 0 50 50 2.97 0.32 2.2 2.8
#> # ... with 6 more variables: median <dbl>, p75 <dbl>, max <dbl>,
#> # std.error <dbl>, mean.low.conf <dbl>, mean.high.conf <dbl>
#>
#> $data_2
#> # A tibble: 6 x 17
#> am cyl type variable missing complete n mean sd min p25
#> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 nume~ wt 0 3 3 2.75 0.13 2.62 2.7
#> 2 1 4 nume~ wt 0 8 8 2.04 0.41 1.51 1.78
#> 3 0 6 nume~ wt 0 4 4 3.39 0.12 3.21 3.38
#> 4 0 8 nume~ wt 0 12 12 4.1 0.77 3.44 3.56
#> 5 0 4 nume~ wt 0 3 3 2.94 0.41 2.46 2.81
#> 6 1 8 nume~ wt 0 2 2 3.37 0.28 3.17 3.27
#> # ... with 6 more variables: median <dbl>, p75 <dbl>, max <dbl>,
#> # std.error <dbl>, mean.low.conf <dbl>, mean.high.conf <dbl>
#>
#> $data_3
#> # A tibble: 15 x 16
#> Month type variable missing complete n mean sd min p25
#> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 inte~ Ozone 5 26 31 23.6 22.2 1 11
#> 2 5 inte~ Solar.R 4 27 31 181. 115. 8 72
#> 3 5 nume~ Wind 0 31 31 11.6 3.53 5.7 8.9
#> 4 6 inte~ Ozone 21 9 30 29.4 18.2 12 20
#> 5 6 inte~ Solar.R 0 30 30 190. 92.9 31 127
#> 6 6 nume~ Wind 0 30 30 10.3 3.77 1.7 8
#> 7 7 inte~ Ozone 5 26 31 59.1 31.6 7 36.2
#> 8 7 inte~ Solar.R 0 31 31 216. 80.6 7 175
#> 9 7 nume~ Wind 0 31 31 8.94 3.04 4.1 6.9
#> 10 8 inte~ Ozone 5 26 31 60.0 39.7 9 28.8
#> 11 8 inte~ Solar.R 3 28 31 172. 76.8 24 107
#> 12 8 nume~ Wind 0 31 31 8.79 3.23 2.3 6.6
#> 13 9 inte~ Ozone 1 29 30 31.4 24.1 7 16
#> 14 9 inte~ Solar.R 0 30 30 167. 79.1 14 117.
#> 15 9 nume~ Wind 0 30 30 10.2 3.46 2.8 7.55
#> # ... with 6 more variables: median <dbl>, p75 <dbl>, max <dbl>,
#> # std.error <dbl>, mean.low.conf <dbl>, mean.high.conf <dbl>
Created on 2018-10-31 by the reprex package (v0.2.1)
As can be seen here, the contents of data argument to purrr::pmap and nm argument in purrr::set_names are exactly identical ((data_1, data_2, data_3)). I want to avoid this repetition (which seems unnecessary here with 3 elements, but I have a much bigger list of arguments). I can't assign this list to a separate object because in one case it is a list, while the other one is entered as alist.
How can I do this?
From tidyverse package, you can also use lst function. lst is used for creating list. It is like tibble function to create tibble but for creating list. One of the difference with base list() is that it automatically names the list.
It is in dplyr, exported from tibble.
For the example, I also replace base alist by rlang::exprs as it is equivalent. Indeed, both are ok.
library(tidyverse)
library(groupedstats)
set.seed(123)
# creating the dataframes
data_1 <- tibble::as.tibble(iris)
data_2 <- tibble::as.tibble(mtcars)
data_3 <- tibble::as.tibble(airquality)
# creating a list
purrr::pmap(
.l = list(
data = lst(data_1, data_2, data_3),
grouping.vars = rlang::exprs(Species, c(am, cyl), Month),
measures = rlang::exprs(c(Sepal.Length, Sepal.Width), wt, c(Ozone, Solar.R, Wind))
),
.f = groupedstats::grouped_summary
) %>%
str(1)
#> List of 3
#> $ data_1:Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 16 variables:
#> $ data_2:Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 17 variables:
#> $ data_3:Classes 'tbl_df', 'tbl' and 'data.frame': 15 obs. of 16 variables:
Created on 2018-11-02 by the reprex package (v0.2.1)
When I've grouped my data by certain attributes, I want to add a "grand total" line that gives a baseline of comparison. Let's group mtcars by cylinders and carburetors, for example:
by_cyl_carb <- mtcars %>%
group_by(cyl, carb) %>%
summarize(median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n())
...yields these results:
> by_cyl_carb
# A tibble: 9 x 5
# Groups: cyl [?]
cyl carb median_mpg avg_mpg count
<dbl> <dbl> <dbl> <dbl> <int>
1 4 1 27.3 27.6 5
2 4 2 25.2 25.9 6
3 6 1 19.8 19.8 2
4 6 4 20.1 19.8 4
5 6 6 19.7 19.7 1
6 8 2 17.1 17.2 4
7 8 3 16.4 16.3 3
8 8 4 13.8 13.2 6
9 8 8 15 15 1
What is the code I need to make it provide a baseline or grand total that would sum (or mean or median) over all of the data? The desired data would be something like this:
cyl carb median_mpg avg_mpg count
<chr> <chr> <dbl> <dbl> <int>
1 4 1 27.3 27.6 5
2 4 2 25.2 25.9 6
3 6 1 19.8 19.8 2
4 6 4 20.1 19.8 4
5 6 6 19.7 19.7 1
6 8 2 17.1 17.2 4
7 8 3 16.4 16.3 3
8 8 4 13.8 13.2 6
9 8 8 15 15 1
10 ttl ttl 19.2 20.1 32
A twist on this would be able to manipulate the output so that the sub-grouped data would be rolled up. For example:
11 ttl 1 13.8 13.2 6
12 ttl 2 15 15 1
13 ttl 3 19.3 20.4 32
14 ... etc ...
The real-life example I am using this for is median sale price of homes by geography by year. Hence I want to report out the median sale price for each geography-year I'm interested in, but I want a baseline comparison for each year regardless of geography.
Edit: Solved with two solutions
#camille referenced this link, which solved the problem, as well as #MKR offering a solution. Here is one code that might work:
by_cyl_carb <- mtcars %>%
mutate_at(vars(c(cyl,carb)), funs(as.character(.))) %>%
bind_rows(mutate(., cyl = "All cylinders")) %>%
bind_rows(mutate(., carb = "All carburetors")) %>%
group_by(cyl, carb) %>%
summarize(median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n())
> by_cyl_carb
# A tibble: 19 x 5
# Groups: cyl [?]
cyl carb median_mpg avg_mpg count
<chr> <chr> <dbl> <dbl> <int>
1 4 1 27.3 27.6 5
2 4 2 25.2 25.9 6
3 4 All carburetors 26 26.7 11
4 6 1 19.8 19.8 2
5 6 4 20.1 19.8 4
6 6 6 19.7 19.7 1
7 6 All carburetors 19.7 19.7 7
8 8 2 17.1 17.2 4
9 8 3 16.4 16.3 3
10 8 4 13.8 13.2 6
11 8 8 15 15 1
12 8 All carburetors 15.2 15.1 14
13 All cylinders 1 22.8 25.3 7
14 All cylinders 2 22.1 22.4 10
15 All cylinders 3 16.4 16.3 3
16 All cylinders 4 15.2 15.8 10
17 All cylinders 6 19.7 19.7 1
18 All cylinders 8 15 15 1
19 All cylinders All carburetors 19.2 20.1 32
A solution using dplyr::bind_rows and mutate_at can be achieved as:
library(tidyverse)
mtcars %>%
group_by(cyl, carb) %>%
summarize(median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n()) %>%
ungroup() %>%
mutate_at(vars(cyl:carb), funs(as.character(.))) %>%
bind_rows(summarise(cyl = "ttl", carb = "ttl", mtcars, median_mpg = median(mpg),
avg_mpg = mean(mpg),
count = n()))
# # A tibble: 10 x 5
# cyl carb median_mpg avg_mpg count
# <chr> <chr> <dbl> <dbl> <int>
# 1 4 1 27.3 27.6 5
# 2 4 2 25.2 25.9 6
# 3 6 1 19.8 19.8 2
# 4 6 4 20.1 19.8 4
# 5 6 6 19.7 19.7 1
# 6 8 2 17.1 17.2 4
# 7 8 3 16.4 16.3 3
# 8 8 4 13.8 13.2 6
# 9 8 8 15.0 15.0 1
#10 ttl ttl 19.2 20.1 32
I have a dataframe containing multiple entries per week. It looks like this:
Week t_10 t_15 t_18 t_20 t_25 t_30
1 51.4 37.8 25.6 19.7 11.9 5.6
2 51.9 37.8 25.8 20.4 12.3 6.2
2 52.4 38.5 26.2 20.5 12.3 6.1
3 52.2 38.6 26.1 20.4 12.4 5.9
4 52.2 38.3 26.1 20.2 12.1 5.9
4 52.7 38.4 25.8 20.0 12.1 5.9
4 51.1 37.8 25.7 20.0 12.2 6.0
4 51.9 38.0 26.0 19.8 12.0 5.8
The Weeks have different amounts of entries, they range from one entry for a week to multiple (up to 4) entries a week.
I want to calculate the medians of each week and output it for all the different variables (t_10 throughout to t_30) in a new dataframe. NA cells are already omitted in the original dataframe. I have tried different approaches through the ddply function of the plyrpackage but to no avail so far.
We could use summarise_at for multiple columns
library(dplyr)
colsToKeep <- c("t_10", "t_30")
df1 %>%
group_by(Week) %>%
summarise_at(vars(colsToKeep), median)
# A tibble: 4 x 3
# Week t_10 t_30
# <int> <dbl> <dbl>
#1 1 51.40 5.60
#2 2 52.15 6.15
#3 3 52.20 5.90
#4 4 52.05 5.90
Specify variables to keep in colsToKeep and store input table in d
library(tidyverse)
colsToKeep <- c("t_10", "t_30")
gather(d, variable, value, -Week) %>%
filter(variable %in% colsToKeep) %>%
group_by(Week, variable) %>%
summarise(median = median(value))
# A tibble: 8 x 3
# Groups: Week [4]
Week variable median
<int> <chr> <dbl>
1 1 t_10 51.40
2 1 t_30 5.60
3 2 t_10 52.15
4 2 t_30 6.15
5 3 t_10 52.20
6 3 t_30 5.90
7 4 t_10 52.05
8 4 t_30 5.90
You can also use the aggregate function:
newdf <- aggregate(data = df, Week ~ . , median)