R, 3-way table, how to order - r

I am trying to order a table that has 3 variables, commonly known as a 3-way table.
I have attached a picture of the structure of the table the replicable code will produce.
Is it possible to order this table in a logical way, despite the fact it is essentially split into three sections/groups? For instance, could you order by the column "No" or the column "Yes" based on the values?
For example, when ordering "No" England would be ordered as "Sertosa" (7), Virginica (8), Versicolour (16). Wales would be ordered Versicolor (11), Setoda (12), Virginica... and so on for each section of the table.
#Replicable code using the Iris data built into R:
Data <- iris
Data $ var2 <- Data $ Species
Data $ var2 <- sample(Data $ var2)
Data $ var3 <- Data $ Species
Data $ var3 <- sample(Data $ var3)
#making the example clearer
library(plyr)
Data $ var2 <- revalue(Data $ var2, c("setosa"="No", "versicolor"="No","virginica" ="Yes"))
Data $ var3 <- revalue(Data $ var3, c("setosa"="England", "versicolor"="Wales","virginica" ="Scotland"))
#3-way Table:
df <- table(Data $ Species, Data $ var2, Data $ var3)
df
Kind Regards, James Prentice, a person trying to get to grips with R.

You should avoid using table() and array() in R, as they are hard to work with. Also, I recommend you focus on learning dplyr, rather than plyr, as plyr is no longer maintained.
Instead of using table(), work directly with the original data frame:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Data <- iris
Data$Living <- sample(c("No", "Yes"), size = nrow(Data), replace = TRUE)
Data$Country <- sample(c("England", "Wales", "Scotland"), size = nrow(Data), replace = TRUE)
# Results in one data frame
Data %>%
group_by(Country, Species) %>%
summarize(Yes = sum(Living == "Yes"), No = sum(Living == "No")) %>%
ungroup() %>%
arrange(Country, Yes)
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 England virginica 2 8
#> 2 England versicolor 7 15
#> 3 England setosa 14 5
#> 4 Scotland setosa 5 14
#> 5 Scotland virginica 6 12
#> 6 Scotland versicolor 9 8
#> 7 Wales setosa 4 8
#> 8 Wales versicolor 5 6
#> 9 Wales virginica 14 8
# Results in a list of data frames
Data %>%
group_by(Country, Species) %>%
summarize(Yes = sum(Living == "Yes"), No = sum(Living == "No")) %>%
ungroup() %>%
arrange(Country, Yes) %>%
split(., .$Country)
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> $England
#> # A tibble: 3 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 England virginica 2 8
#> 2 England versicolor 7 15
#> 3 England setosa 14 5
#>
#> $Scotland
#> # A tibble: 3 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 Scotland setosa 5 14
#> 2 Scotland virginica 6 12
#> 3 Scotland versicolor 9 8
#>
#> $Wales
#> # A tibble: 3 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 Wales setosa 4 8
#> 2 Wales versicolor 5 6
#> 3 Wales virginica 14 8
Created on 2021-06-01 by the reprex package (v2.0.0)

Related

R calculate most abundant taxa using phyloseq object

I would like to know if my approach to calculate the average of the relative abundance of any taxon is correct !!!
If I want to know if, to calculate the relative abundance (percent) of each family (or any Taxon) in a phyloseq object (GlobalPattern) will be correct like:
data("GlobalPatterns")
T <- GlobalPatterns %>%
tax_glom(., "Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>% psmelt() %>%
arrange(OTU) %>% rename(OTUsID = OTU) %>%
select(OTUsID, Family, Sample, Abundance) %>%
spread(Sample, Abundance)
T$Mean <- rowMeans(T[, c(3:ncol(T))])
FAM <- T[, c("Family", "Mean" ) ]
#order data frame
FAM <- FAM[order(dplyr::desc(FAM$Mean)),]
rownames(FAM) <- NULL
head(FAM)
Family Mean
1 Bacteroidaceae 7.490944
2 Ruminococcaceae 6.038956
3 Lachnospiraceae 5.758200
4 Flavobacteriaceae 5.016402
5 Desulfobulbaceae 3.341026
6 ACK-M1 3.242808
in this case the Bacteroidaceae were the most abundant family in all the samples of GlobalPattern (26 samples and 19216 OTUs), it was present in 7.49% in average in 26 samples !!!!
It’s correct to make the T$Mean <- rowMeans(T[, c(3:ncol(T))]) to calculate the average any given Taxon ?
Bacteroidaceae has the highest abundance, if all samples were pooled together.
However, it has the highest abundance in only 2 samples.
Nevertheless, there is no other taxon having a higher abundance in an average sample.
Let's use dplyr verbs for all the steps to have a more descriptive and consistent code:
library(tidyverse)
library(phyloseq)
#> Creating a generic function for 'nrow' from package 'base' in package 'biomformat'
#> Creating a generic function for 'ncol' from package 'base' in package 'biomformat'
#> Creating a generic function for 'rownames' from package 'base' in package 'biomformat'
#> Creating a generic function for 'colnames' from package 'base' in package 'biomformat'
data(GlobalPatterns)
data <-
GlobalPatterns %>%
tax_glom("Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>%
psmelt() %>%
as_tibble()
# highest abundance: all samples pooled together
data %>%
group_by(Family) %>%
summarise(Abundance = mean(Abundance)) %>%
arrange(-Abundance)
#> # A tibble: 334 × 2
#> Family Abundance
#> <chr> <dbl>
#> 1 Bacteroidaceae 7.49
#> 2 Ruminococcaceae 6.04
#> 3 Lachnospiraceae 5.76
#> 4 Flavobacteriaceae 5.02
#> 5 Desulfobulbaceae 3.34
#> 6 ACK-M1 3.24
#> 7 Streptococcaceae 2.77
#> 8 Nostocaceae 2.62
#> 9 Enterobacteriaceae 2.55
#> 10 Spartobacteriaceae 2.45
#> # … with 324 more rows
# sanity check: is total abundance of each sample 100%?
data %>%
group_by(Sample) %>%
summarise(Abundance = sum(Abundance)) %>%
pull(Abundance) %>%
`==`(100) %>%
all()
#> [1] TRUE
# get most abundant family for each sample individually
data %>%
group_by(Sample) %>%
arrange(-Abundance) %>%
slice(1) %>%
select(Family) %>%
ungroup() %>%
count(Family, name = "n_samples") %>%
arrange(-n_samples)
#> Adding missing grouping variables: `Sample`
#> # A tibble: 18 × 2
#> Family n_samples
#> <chr> <int>
#> 1 Desulfobulbaceae 3
#> 2 Bacteroidaceae 2
#> 3 Crenotrichaceae 2
#> 4 Flavobacteriaceae 2
#> 5 Lachnospiraceae 2
#> 6 Ruminococcaceae 2
#> 7 Streptococcaceae 2
#> 8 ACK-M1 1
#> 9 Enterobacteriaceae 1
#> 10 Moraxellaceae 1
#> 11 Neisseriaceae 1
#> 12 Nostocaceae 1
#> 13 Solibacteraceae 1
#> 14 Spartobacteriaceae 1
#> 15 Sphingomonadaceae 1
#> 16 Synechococcaceae 1
#> 17 Veillonellaceae 1
#> 18 Verrucomicrobiaceae 1
Created on 2022-06-10 by the reprex package (v2.0.0)

How to convert normalized numeric variable (library(recipes)) back to original value in R

I normalized the numeric variables by library(recipes) in R before putting into Decision Tree models to predict outcome. Now, I have decision tree, and age is one of important variables in the node, like >1.5 and < 1.5. I want to convert that -1.5 back into a non-normalized value to be able to give it a practical meaning (like age >50 or </= 50 years old). I have searched and cannot find the answer.
library(recipes)
recipe_obj <- dataset %>%
recipe(formula = anyaki ~.) %>% #specify formula
step_center(all_numeric()) %>% #center data (0 mean)
step_scale(all_numeric()) %>% #std = 1
prep(data = dataset)
dataset_scaled <- bake(recipe_obj, new_data = dataset)
Age is one of variables that have been normalized in recipes package in R. Now, I am struggling to convert the normalized data that I have in the final model back to into a non-normalized value to be able to give it a practical meaning. How can I do this?
You can access these kind of estimated values using the tidy() method for recipes and recipe steps. Check out more details here and here.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(penguins)
penguin_rec <- recipe(~ ., data = penguins) %>%
step_other(all_nominal(), threshold = 0.2, other = "another") %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal())
tidy(penguin_rec)
#> # A tibble: 3 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step other FALSE FALSE other_ZNJ2R
#> 2 2 step normalize FALSE FALSE normalize_ogEvZ
#> 3 3 step dummy FALSE FALSE dummy_YVCBo
tidy(penguin_rec, number = 1)
#> # A tibble: 1 × 3
#> terms retained id
#> <chr> <chr> <chr>
#> 1 all_nominal() <NA> other_ZNJ2R
penguin_prepped <- prep(penguin_rec, training = penguins)
#> Warning: There are new levels in a factor: NA
tidy(penguin_prepped)
#> # A tibble: 3 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step other TRUE FALSE other_ZNJ2R
#> 2 2 step normalize TRUE FALSE normalize_ogEvZ
#> 3 3 step dummy TRUE FALSE dummy_YVCBo
tidy(penguin_prepped, number = 1)
#> # A tibble: 6 × 3
#> terms retained id
#> <chr> <chr> <chr>
#> 1 species Adelie other_ZNJ2R
#> 2 species Gentoo other_ZNJ2R
#> 3 island Biscoe other_ZNJ2R
#> 4 island Dream other_ZNJ2R
#> 5 sex female other_ZNJ2R
#> 6 sex male other_ZNJ2R
tidy(penguin_prepped, number = 2)
#> # A tibble: 8 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 bill_length_mm mean 43.9 normalize_ogEvZ
#> 2 bill_depth_mm mean 17.2 normalize_ogEvZ
#> 3 flipper_length_mm mean 201. normalize_ogEvZ
#> 4 body_mass_g mean 4202. normalize_ogEvZ
#> 5 bill_length_mm sd 5.46 normalize_ogEvZ
#> 6 bill_depth_mm sd 1.97 normalize_ogEvZ
#> 7 flipper_length_mm sd 14.1 normalize_ogEvZ
#> 8 body_mass_g sd 802. normalize_ogEvZ
Created on 2021-08-07 by the reprex package (v2.0.0)

Is there a way to summarise if column value is x?

I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)

Run a aov test through a tibble in a tidy way

I want to run a linear regression on a data frame using the same dependent variable. A similar question was solved here. The problem is that aov function to implement ANOVA doesn't accept x and y as arguments (as far as I know). Is there a way to implement the analysis in a tidy way? So far I've tried something like:
library(tidyverse)
iris %>%
as_tibble() %>%
select(Sepal.Length, Species) %>%
mutate(foo_a = as_factor(sample(c("a", "b", "c"), nrow(.), replace = T)),
foo_b = as_factor(sample(c("d", "e", "f"), nrow(.), replace = T))) %>%
map(~aov(Sepal.Length ~ .x, data = .))
Created on 2019-02-12 by the reprex package (v0.2.1)
The desired output is three analysis: Sepal.Length and Species, Sepal.Length and foo_a and the last one Sepal.Length and foo_b. Is it possible or I am totally wrong?
One approach is to make this into a long-shaped data frame, group by the independent variable of interest, and use the "many models" approach. I usually prefer something like this over trying to do tidyeval across multiple columns—it just gives me a clearer sense of what's going on.
To save space, I'm working with iris_foo, which is your data as you created it up through the 2 mutate lines. Putting it into a long format gives you a key of the names of those three columns that will be used as independent variables in each of the aov calls.
library(tidyverse)
iris_foo %>%
gather(key, value, -Sepal.Length)
#> # A tibble: 450 x 3
#> Sepal.Length key value
#> <dbl> <chr> <chr>
#> 1 5.1 Species setosa
#> 2 4.9 Species setosa
#> 3 4.7 Species setosa
#> 4 4.6 Species setosa
#> 5 5 Species setosa
#> 6 5.4 Species setosa
#> 7 4.6 Species setosa
#> 8 5 Species setosa
#> 9 4.4 Species setosa
#> 10 4.9 Species setosa
#> # … with 440 more rows
From there, nest by key and create a new list-column of ANOVA models. This will be a list of aov objects. For simplicity with getting your models back out, you can drop the data column.
aov_models <- iris_foo %>%
gather(key, value, -Sepal.Length) %>%
group_by(key) %>%
nest() %>%
mutate(model = map(data, ~aov(Sepal.Length ~ value, data = .))) %>%
select(-data)
aov_models
#> # A tibble: 3 x 2
#> key model
#> <chr> <list>
#> 1 Species <S3: aov>
#> 2 foo_a <S3: aov>
#> 3 foo_b <S3: aov>
From there, you can work with the models however you like. They're accessible in the list aov_models$model. Printed, they look how you'd expect. For example, the first model:
aov_models$model[[1]]
#> Call:
#> aov(formula = Sepal.Length ~ value, data = .)
#>
#> Terms:
#> value Residuals
#> Sum of Squares 63.21213 38.95620
#> Deg. of Freedom 2 147
#>
#> Residual standard error: 0.5147894
#> Estimated effects may be unbalanced
To see all the models, call aov_models$model %>% map(print). You might also want to use broom functions, such as broom::tidy or broom::glance, depending on how you need to present the models.
aov_models$model %>%
map(broom::tidy)
#> [[1]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 63.2 31.6 119. 1.67e-31
#> 2 Residuals 147 39.0 0.265 NA NA
#>
#> [[2]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.281 0.141 0.203 0.817
#> 2 Residuals 147 102. 0.693 NA NA
#>
#> [[3]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.756 0.378 0.548 0.579
#> 2 Residuals 147 101. 0.690 NA NA
Or tidying all the models into a single data frame, which keeps the key column, you could do:
aov_models %>%
mutate(model_tidy = map(model, broom::tidy)) %>%
unnest(model_tidy)

How can I create a copy of a data frame excluding columns of type list? [duplicate]

I have a data.frame with almost 200 variables (columns) and different type of data (num, int, logi, factor). Now, I would like to remove all the variables of the type "factor" to run the function cor()
When I use the function str() I can see which variables are of the type "factor", but I don't know how to select and remove all these variables, because removing one by one is time consuming. To select these variables I have tried attr(), and typeof() without results.
Some direction?
Assuming a generic data.frame this will remove columns of type factor
df[,-which(sapply(df, class) == "factor")]
EDIT
As per #Roland's suggestion, you can also just keep those which are not factor. Whichever you prefer.
df[, sapply(df, class) != "factor"]
EDIT 2
As you are concerned with the cor function, #Ista also points out that it would be safer in that particular instance to filter on is.numeric. The above are only to remove factor types.
df[,sapply(df, is.numeric)]
Here's a very useful tidyverse solution, adapted from here:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(tidyverse)
# Create dummy dataset with multiple variable types
df <-
tibble::tribble(
~var_num_1, ~var_num_2, ~var_char, ~var_fct, ~var_date,
1, 10, "this", "THIS", "2019-12-18",
2, 20, "is", "IS", "2019-12-19",
3, 30, "dummy", "DUMMY", "2019-12-20",
4, 40, "character", "FACTOR", "2019-12-21",
5, 50, "text", "TEXT", "2019-12-22"
) %>%
mutate(
var_fct = as_factor(var_fct),
var_date = as_date(var_date)
)
# Select numeric variables
df %>% select_if(is.numeric)
#> # A tibble: 5 x 2
#> var_num_1 var_num_2
#> <dbl> <dbl>
#> 1 1 10
#> 2 2 20
#> 3 3 30
#> 4 4 40
#> 5 5 50
# Select character variables
df %>% select_if(is.character)
#> # A tibble: 5 x 1
#> var_char
#> <chr>
#> 1 this
#> 2 is
#> 3 dummy
#> 4 character
#> 5 text
# Select factor variables
df %>% select_if(is.factor)
#> # A tibble: 5 x 1
#> var_fct
#> <fct>
#> 1 THIS
#> 2 IS
#> 3 DUMMY
#> 4 FACTOR
#> 5 TEXT
# Select date variables
df %>% select_if(is.Date)
#> # A tibble: 5 x 1
#> var_date
#> <date>
#> 1 2019-12-18
#> 2 2019-12-19
#> 3 2019-12-20
#> 4 2019-12-21
#> 5 2019-12-22
# Select variables using negation (note the use of `~`)
df %>% select_if(~!is.numeric(.))
#> # A tibble: 5 x 3
#> var_char var_fct var_date
#> <chr> <fct> <date>
#> 1 this THIS 2019-12-18
#> 2 is IS 2019-12-19
#> 3 dummy DUMMY 2019-12-20
#> 4 character FACTOR 2019-12-21
#> 5 text TEXT 2019-12-22
Created on 2019-12-18 by the reprex package (v0.3.0)

Resources