I am new to tidyverse and want to use pipes to create two new variables, one representing the sum of the petal lengths by Species, and one representing the number of instances of each Species, and then to represent that in a new list alongside the Species names.
The following code does the job, but
library(dplyr)
petal_lengths <- iris %>% group_by(Species) %>% summarise(total_petal_length = sum(Petal.Length))
totals_per_species <- iris %>% count(Species, name="Total")
combined_data <- modifyList(petal_lengths,totals_per_species)
My questions are:
Is it possible to do this without the creating those two intermediate variables petal_lengths and totals_per_species, i.e. through a single line of piping code rather than two.
If so, is doing this desirable, either abstractly or according to standard conceptions of good tidyverse coding style?
I read here that
The pipe can only transport one object at a time, meaning it’s not so
suited to functions that need multiple inputs or produce multiple
outputs.
which makes me think maybe the answer to my first question is "No", but I'm not sure.
You could achieve your desired result in one pipeline like so:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(total_petal_length = sum(Petal.Length), Total = n())
#> # A tibble: 3 × 3
#> Species total_petal_length Total
#> <fct> <dbl> <int>
#> 1 setosa 73.1 50
#> 2 versicolor 213 50
#> 3 virginica 278. 50
I think Stefan's answer is the correct one for this particular example, and in general you can get the pipe to work with most data manipulation tasks without writing intermediate variables. However, there is perhaps a broader question here.
There are some situations in which the writing of intermediate variables is necessary, and other situations where you have to write more complicated code in the pipe to avoid creating intermediate variables.
I have used a little helper function in some situations to avoid this, which writes a new variable as a side effect. This variable can be re-used within the same pipeline:
branch <- function(.data, newvar, value) {
newvar <- as.character(as.list(match.call())$newvar)
assign(newvar, value, parent.frame(2))
return(.data)
}
You would use it in the pipeline like this:
iris %>%
branch(totals_per_species, count(., Species, name = "Total")) %>%
group_by(Species) %>%
summarise(total_petal_length = sum(Petal.Length)) %>%
modifyList(totals_per_species)
#> # A tibble: 3 x 3
#> Species total_petal_length Total
#> <fct> <dbl> <int>
#> 1 setosa 73.1 50
#> 2 versicolor 213 50
#> 3 virginica 278. 50
This function works quite well in interactive sessions, but there are probably scoping problems when used in more complex settings. It's certainly not standard coding practice, though I have often wondered whether a more robust version might be a useful addition to the tidyverse.
Related
I'm new to plyr and dplyr and seriously don't get it. I have managed my way around some functions, but I struggle with really basic stuff such as the following example.
Taking mtcars, I have different overlapping subsets, such as vs = 1 and am = 1
I now want to run the same analysis, in this case median() for one var over the different subsets, and another analysis, such as mean() for another var.
This should give me in the end the same result, such as the following code - just much shorter:
data_mt <- mtcars # has binary dummy vars for grouping
data_vs <- data_mt[ which(data_mt$vs == 1 ), ]
data_am <- data_mt[ which(data_mt$am == 1 ), ]
median(data_mt$mpg)
median(data_vs$mpg)
median(data_am$mpg)
mean(data_mt$cyl)
mean(data_vs$cyl)
mean(data_am$cyl)
In my real example, I have an analog to data_mt, so if you have a solution starting there, without data_vs etc. that would be great.
I'm sure this is very basic, but I can't wrap my head around it - and as I have some 1500 variables that I want to look at, I'd appreciate your help =)
It may well be that my answer is already out there, but with the terminology I know I didn't find it explain for Dummies ;D
Edit:
To have a better understanding of what I am doing and what I am looking for, I hereby post my original code (not the mtcars example).
I have a dataset ds with 402 observations of 553 variables
The dataset comes from a study with human participants, some of which opted in for additional research mys or obs or both.
ds$mys <- 0
ds$mys[ which(ds$staffmystery_p == "Yes" ) ] <- 1
ds$obs <- 0
ds$obs[ which( !is.na(ds$sales_time)) ] <- 1
The 553 variables are either integers (e.g. for age or years of experience) or factors (e.g. sex or yes/no answers). I now want to compare some descriptive of the full dataset with the descriptives for the subsets and ideally also do a t-test for difference.
Currently I have just a very long list of code that reads more or less like the following (just much longer). This doesn't include t-tests.
describe(ds$age_b)
describe(dm$age_b)
describe(do$age_b)
prop.table(table(ds$sex_b))*100
prop.table(table(dm$sex_b))*100
prop.table(table(do$sex_b))*100
ds, dm and do are different datasets, but they are all just based on the above mentioned full dataset ds and the subsets ds$mys for dm and ds$obs for do
describe comes from the psych package and just lists descriptive statistics like mean or median etc. I don't need all of the metrics, mostly n, mean, median, sd and iqr.
The formula around 'prop.table' gives me a readout I can just copy into the excel tables I use for the final publications. I don't want automated output because I get asked all the time to add or change this, which is really just easier in excel than with automated output. (unless you know a much superior way ;)
Thank you so much!
Here is an option if we want to do this for different columns by group separately
library(dplyr)
library(purrr)
library(stringr)
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by(across(all_of(.x))) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
-output
# A tibble: 2 x 8
# vs Mean_cyl_vs Median_mpg_vs am Mean_cyl_am Median_mpg_am Mean_cyl_full Median_mpg_full
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 7.44 15.6 0 6.95 17.3 6.19 19.2
#2 1 4.57 22.8 1 5.08 22.8 6.19 19.2
If the package version is old, we can replace the across with group_by_at
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by_at(vars(.x)) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
Update
Based on the update, we could place the datasets in a list, do the transformations at once and return a list of descriptive statistics and the proportion table
out <- map(dplyr::lst(dm, ds, do), ~ {
dat <- .x %>%
mutate(mys = as.integer(staffmystery_p == 'Yes'),
obs = as.integer(!is.na(sales_time)))
age_b_desc <- describe(dat$age_b)
prop_table_out <- prop.table(table(dat$sex_b))*100
return(dplyr::lst(age_b_desc, prop_table_out))
}
)
I'm trying to do some twitter analysis using r studios and I came across a certain section of the guide where it aggregates a few columns together. Where I group the persons name, and summarise the mean of the users followers and friends.
On the guide it shows this
However, when I try to repeat the exact same code on to my r studios, It instead,
shows me the following output. Why doesn't it show the "screen_name" column and it somehow takes the mean of all the rows?
It is possible that plyr also got loaded along with dplyr and the plyr::summarise masked the dplyr::summarise
library(dplyr)
iris %>%
group_by(Species) %>%
plyr::summarise(Sepal.Width = mean(Sepal.Width))
# Sepal.Width
#1 3.057333
An option is to either do this on a fresh R session with dplyr only loaded or use dplyr:: explicitly to avoid getting masked
iris %>%
group_by(Species) %>%
dplyr::summarise(Sepal.Width = mean(Sepal.Width), .groups = 'drop')
# A tibble: 3 x 2
# Species Sepal.Width
# <fct> <dbl>
#1 setosa 3.43
#2 versicolor 2.77
#3 virginica 2.97
Assuming I have a data frame like the below (actual data frame has million observations). I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
I have tried “dplyr” library and combination of functions “group_by” and “summarize”. However, I am only able to get correlation between two columns and not the multiple columns.
library(dplyr)
df %>%
group_by(Signal_Up) %>%
summarize (COR=cor(signal, Net_return_at_t_plus1))
Data and desired result are given below.
Data
Desired Result
Correlation between "signal" Vs ["Net_return_at_t_plus1", "Net_return_at_t_plus5", "Net_return_at_t_plus10"]
Group by "Signal_Up"
Maybe you can try to use summarise_at to perform the correlation over several columns.
Here, I took the iris dataset as example:
library(dplyr)
iris %>% group_by(Species) %>%
summarise_at(vars(Sepal.Length:Petal.Length), ~cor(Petal.Width,.))
# A tibble: 3 x 4
Species Sepal.Length Sepal.Width Petal.Length
<fct> <dbl> <dbl> <dbl>
1 setosa 0.278 0.233 0.332
2 versicolor 0.546 0.664 0.787
3 virginica 0.281 0.538 0.322
For your dataset, you should try something like:
library(dplyr)
df %>% group_by(Signal_Up) %>%
summarise_at(vars(Net_return_at_t_plus1:Net_return_at_t_plus1), ~cor(signal,.))
Does it answer your question ?
NB: It is easier for people to try to solve your issue if you are providing reproducible example that they can easily copy/paste instead of adding it as an image (see: How to make a great R reproducible example)
I have a large data set with over 2000 observations. The data involves toxin concentrations in animal tissue. My response variable is myRESULT and I have multiple observations per ANALYTE of interest. I need to remove the outliers, as defined by numbers more than three SD away from the mean, from within each ANALYTE group.
While I realize that I should not remove outliers from a dataset normally, I would still like to know how to do it in R.
Here is a small portion of what my data look like:
It's subsetting by group, which can be done in different ways. With dplyr, you use group_by to set grouping, then filter to subset rows, passing it an expression that will calculate return TRUE for rows to keep, and FALSE for outliers.
For example, using iris and 2 standard deviations (everything is within 3):
library(dplyr)
iris_clean <- iris %>%
group_by(Species) %>%
filter(abs(Petal.Length - mean(Petal.Length)) < 2*sd(Petal.Length))
iris_clean %>% count()
#> # A tibble: 3 x 2
#> # Groups: Species [3]
#> Species n
#> <fct> <int>
#> 1 setosa 46
#> 2 versicolor 47
#> 3 virginica 47
With a split-apply-combine approach in base R,
do.call(rbind, lapply(
split(iris, iris$Species),
function(x) x[abs(x$Petal.Length - mean(x$Petal.Length)) < 2*sd(x$Petal.Length), ]
))
I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf