Tidymodels infer chisq_test count column

Tidymodels infer chisq_test count column - r

I am using the infer library to run a chisq_test in a group_by with subgroup ~ answer.
I have, among others, a column with subgroup, one with answers and one with count.
Is it possible to specify the count column when running
dat <- dat %>%
group_by(Question, Group) %>%
mutate(p_value = chisq_test(cur_data(), Subgroup ~ Answer)$p_value) %>%
ungroup()
Or do I need to use uncount(Count) first?

Related

Group by, summarise and return the value back to the dataset in R?

I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.

Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())

You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)

R check for outliers in multiple variables

I need to check my data fro outliers and I have 67 different variables. So I don't want to do it by hand. This is my code for checking it by hand (I have three factors to be checked - voiceID, gender and VP). But I don't know how I should change it to a loop that iterates over columns.
features %>%
group_by(voiceID, gender, VP) %>%
identify_outliers(meanF0)
The values are all numbers. The output should tell me which rows for what factors are outliers.
Thanks for help

The output of identify_outliers is a tibble with multiple columns and it can take a single variable at a time. The variable name can be either quoted or unquoted. In that case, we can group_split the data by the grouping variables, then loop over the columns of interest, and apply the identify_outliers
library(dplyr)
library(purrr)
library(rstatix)
nm1 <- c("score", "score2")
demo.data %>%
group_split(gender) %>%
map(~ map(nm1, function(x) .x %>%
identify_outliers(x)))
If we want to count the outliers,
features %>%
group_by(voiceID, gender, VP) %>%
summarise(across(everything(), ~ length(boxplot(., plot = FALSE)$out)))

How to mutate paneldata with dplyr in R?

I have panel data (person-year combination) for which I need to investigate the impact that your partner's characterics (several "x") have on your outcome variable (y). Everything is given in one tibble/dataframe. Partner information is given by "pid".
paneldata = data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4), time=seq(1:3), pid=c(3,3,NA,4,4,3,1,1,2,2,2,NA),
y=c(9,10,11,12,13,14,15,16,17,18,19,20), x=c(21,22,23,24,25,26,27,28,29,30,31,32),
x_partner=c(27,28,NA,30,31,29,21,22,26,24,25,NA))
library(dplyr)
paneldata %>%
group_by(id, time) %>%
mutate(x_pid = x[pid])
I want to achieve x_partner, but what I have to far is x_pid. I'm trying to catch the index, while running through group_by "id" and "time", get the "pid" (not unique!) and look at x at combination pid-time.

You shouldn't be grouping by id, only by time.
paneldata %>%
group_by(time) %>%
mutate(x_partner = x[match(id, pid)])

Categorical Variables Table with Percentages in R

I have a series of categorical variables that have the response options (Favorable, Unfavorable, Neutral).
I want to create a table in R that will give the list of all 10 variables in rows (one variable per row) - with the percentage response "Favorable, Unfavorable, Neutral" in the columns. Is this possible in R? Ideally, I would also want to be able to group this by another categorical variable (e.g. to compare how males vs. females responded to the questions differently).

You'll get better answers if you provide a sample of your actual data (see this post). That said, here is a solution using dplyr:: (and reshape2::melt).
# function to create a column of fake data
make_var <- function(n=100) sample(c("good","bad","ugly"), size=n, replace=TRUE)
# put ten of them together
dat <- as.data.frame(replicate(10, make_var()), stringsAsFactors=FALSE)
library("dplyr")
# then reshape to long format, group, and summarize --
dat %>% reshape2::melt(NULL) %>% group_by(variable) %>% summarize(
good_pct = (sum(value=="good") / length(value)) * 100,
bad_pct = (sum(value=="bad") / length(value)) * 100,
ugly_pct = (sum(value=="ugly") / length(value)) * 100
)
Note that to group by another column (e.g. sex), you can just say group_by(variable, sex) before you summarize (as long as sex is a column of the data, which isn't the case in this constructed example).

Adapting lefft's example but trying to do everything in dplyr:
dat %>%
gather(variable, value) %>%
group_by(variable) %>%
count(value) %>%
mutate(pct = n / sum(n) * 100) %>%
select(-n) %>%
spread(value, pct)

How to extract one specific group in dplyr

Given a grouped tbl, can I extract one/few groups?
Such function can be useful when prototyping code, e.g.:
mtcars %>%
group_by(cyl) %>%
select_first_n_groups(2) %>%
do({'complicated expression'})
Surely, one can do an explicit filter before grouping, but that can be cumbersome.

With a bit of dplyr along with some nesting/unnesting (supported by tidyr package), you could establish a small helper to get the first (or any) group
first = function(x) x %>% nest %>% ungroup %>% slice(1) %>% unnest(data)
mtcars %>% group_by(cyl) %>% first()
By adjusting the slicing you could also extract the nth or any range of groups by index, but typically the first or the last is what most users want.
The name is inspired by functional APIs which all call it first (see stdlibs of i.e. kotlin, python, scala, java, spark).
Edit: Faster Version
A more scalable version (>50x faster on large datasets) that avoids nesting would be
first_group = function(x) x %>%
select(group_cols()) %>%
distinct %>%
ungroup %>%
slice(1) %>%
{ semi_join(x, .)}
A another positive side-effect of this improved version is that it fails if not grouping is present in x.

Try this where groups is a vector of group numbers. Here 1:2 means the first two groups:
select_groups <- function(data, groups, ...)
data[sort(unlist(attr(data, "indices")[ groups ])) + 1, ]
mtcars %>% group_by(cyl) %>% select_groups(1:2)
The selected rows appear in the original order. If you prefer that the rows appear in the order that the groups are specified (e.g. in the above eaxmple the rows of the first group followed by the rows of the second group) then remove the sort.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Tidymodels infer chisq_test count column - r

Related

Group by, summarise and return the value back to the dataset in R?

R check for outliers in multiple variables

How to mutate paneldata with dplyr in R?

Categorical Variables Table with Percentages in R

How to extract one specific group in dplyr

Categories

Resources