Fatal error with min_rank inside mutate with a data frame - r

I am trying to practice window functions and just want to experiment with this code but my R session aborts when I run this code:
library(dplyr)
mtcars %>%
select(mpg, cyl) %>%
mutate(r = min_rank())
If these functions behave differently depending if context is a df or vector how should we know what they do? All the examples are for vectors... E.g. row_number()behaves differently on a data frame compared to a vector.

Related

Is it possible to use group_by in a function for more than one variable?

I created a function that aggregates the numeric values in a dataset, and I use a group_by() function to group the data first. Below is an example of what the code I wrote looks like. Is there a way I can group_by() more than one variable without having to create another input for the function?
agg <- function(data, group){ aggdata <- data %>% group_by({{group}}) %>% select_if(function(col) !is.numeric(col) & !is.integer(col)) %>% summarise_if(is.numeric, sum, na.rm = TRUE) return(aggdata)
Your code has (at least) a misplaced curly brace, and it's a bit difficult to see what you're trying to accomplish without a reproducible example and desired result.
It is possible to pass a vector of variable names to group_by(). For example, the following produces the same result as mtcars %>% group_by(cyl, gear):
my_groups <- c("cyl", "gear")
mtcars %>% group_by(!!!syms(my_groups))
Maybe you could use this syntax within your function definition.

Using the R syntax sequence operator ":" within the the sum command with more then 50 columns

i would like to index by column name within the sum command using the sequence operator.
library(dbplyr)
library(tidyverse)
df=data.frame(
X=c("A","B","C"),
X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3),
X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3),
X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3),
X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3),
X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3),
X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3))
Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above)
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)),
X=="B"~ sum(c(X.4,X.5)),
X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16,
X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30,
X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44,
X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column)
The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":".
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence.
Using mutate rowwise over a subset of columns
EDIT:
the provided code (below) from cnbrowlie is correct.
df %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns):
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ rowSums(across(X.1:X.5)),
X=="B"~ rowSums(across(X.4:X.5)),
X=="C" ~ rowSums(across(X.3:X.56))
)
) %>% dplyr::select(Result_column)

When using rstatix::get_summary_stats how do I export the results to a .csv file?

I am trying to calculate group means from a 4-way mixed ANOVA. The code is working great -- and I can output the results into my console -- but how do I save this table to a file (i.e. export as a .csv)? I tried using the capture.output() command, but then it cannot find the object score.
##Group Means
data %>%
group_by(region, task, production, position) %>%
get_summary_stats(score, type = "mean_sd")
I think you're looking for readr::write_csv() (or write.csv() from base R)
library(dplyr)
library(readr)
library(readr)
mtcars %>%
group_by(cyl) %>%
get_summary_stats(hp, type="mean_sd") %>%
write_csv("my_output.csv")
You might also be interested in the emmeans package, for getting "expected marginal means" from more complex statistical models.

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.
First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...
You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

Resources