dplyr is not working properly when using group by in it - r

Am calculating error rates between two different forecasting methods.
My basic approach is to get group by nk and calculate the errors to compare and choose the one which have less error rate value. The issue is am getting MAP1E_arima_ds and MAPE_cagr_ds is all the same value some how the group_by function is not working while calculating.
Here is something I tried
group_by(nk) %>%
mutate(MAP1E_arima_ds=sum(temp2$ABS_arima_error_ds)/nrow(temp2)) %>%
mutate(MAPE_cagr_ds=sum(temp2$ABS_cagr_error_ds)/nrow(temp2))
So finally expected like
nk MAP1E_arima_ds MAPE_cagr_ds
1-G0175 value_x value_y
1-H0182 value_z value_a
so that I can compare error rate and choose forecasting method with less error rate.

If I understand you correctly, I think what you are looking for is this
library(dplyr)
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds=sum(ABS_arima_error_ds)/n(),
MAPE_cagr_ds=sum(ABS_cagr_error_ds)/n())
# A tibble: 2 x 3
# nk MAP1E_arima_ds MAPE_cagr_ds
# <chr> <dbl> <dbl>
#1 1-G0175 14.7 3.38
#2 1-H0182 2.91 7.40
which is actually mean
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds = mean(ABS_arima_error_ds),
MAPE_cagr_ds = mean(ABS_cagr_error_ds))
Moreover, after copying your dput it seems that your data is already grouped by nk, so the following would also give the same result
df %>%
summarise(MAP1E_arima_ds=mean(ABS_arima_error_ds),
MAPE_cagr_ds=mean(ABS_cagr_error_ds))

Related

Summarizing by group using dplyr not working as expected

I am trying to summarize demographic information of a dataframe and I am running into some issues. Breaking it down by gender, there are 4 possible options that participants can choose from: 1,2,3,4 with blanks (no response) being treated as NA values by R. I am getting the correct counts for each gender but when trying to obtain the mean of each gender is where I am running into issues.
I'd like to keep the observations with NA values because while they may not have answered demographic information, they have answered other questions hence why I do not want to simply remove those rows from the dataframe.
Here is what I tried
#df$q10: what is your gender
by_gender = df %>%
group_by(df$Q10) %>%
dplyr::summarize(count = n(),
AvgAge = mean(df$Q11_1_TEXT, na.rm = TRUE))
by_gender
This returns the same value for all genders as
mean(df$Q11_1_TEXT, na.rm = TRUE)
Both the gender and age columns have NA values and I suspect this is where the issue may be? I tried adding na.rm = T but that does not seem to work. What else can I try?
Edit: Removing $ makes the function work as expected.
When you ask for mean(df$Q11_1_TEXT) it will calculate a mean from the original ungrouped vector, whereas if you use mean(Q11_1_TEXT) it will look for Q11_1_TEXT within the grouped data frame it received from the prior step.
Compare:
mtcars %>%
group_by(gear) %>%
summarize(wt_ttl = sum(wt),
wt_ttl2 = sum(mtcars$wt))
# A tibble: 3 × 3
gear wt_ttl wt_ttl2
<dbl> <dbl> <dbl>
1 3 58.4 103.
2 4 31.4 103.
3 5 13.2 103.

R as_tsibble() running out of memory

I have a dataset with 3.9M rows, 5 columns, stored as a tibble. When I try to convert it to tsibble, I run out of memory even though I have 32 GB which should be way more than enough. The weird thing is that if I apply a filter function before piping it into as_tsibble() then it works, even though I'm not actually filtering out any rows.
This does not work:
dataset %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))
This works. But there are no "Phase" values less than 1 so the filter does nothing, no rows are actually removed.
dataset %>% filter(Phase > 0) %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))
Any ideas why the second option works? Here's what the dataset looks like:
Volume <dbl>
Travel_Time <dbl>
TSSU <chr>
Phase <int>
TimeStamp <dttm>
105
1.23
01017
2
2020-09-28 10:00:00
20
1.11
01017
2
2020-09-28 10:15:00
Have you tried using the data.table library? It is optimized for performance with large datasets. I have replicated your steps and depending on where the dataset variable is coming from you may want to use the fread function to load the data as it is also very fast.
library(data.table)
dataset <- data.table(dataset)
# setkeyv(x = dataset, cols = c("TSSU", "Phase")) # This line may not be needed
dataset[Phase>0, ]

plyr for same analysis accross different subsets

I'm new to plyr and dplyr and seriously don't get it. I have managed my way around some functions, but I struggle with really basic stuff such as the following example.
Taking mtcars, I have different overlapping subsets, such as vs = 1 and am = 1
I now want to run the same analysis, in this case median() for one var over the different subsets, and another analysis, such as mean() for another var.
This should give me in the end the same result, such as the following code - just much shorter:
data_mt <- mtcars # has binary dummy vars for grouping
data_vs <- data_mt[ which(data_mt$vs == 1 ), ]
data_am <- data_mt[ which(data_mt$am == 1 ), ]
median(data_mt$mpg)
median(data_vs$mpg)
median(data_am$mpg)
mean(data_mt$cyl)
mean(data_vs$cyl)
mean(data_am$cyl)
In my real example, I have an analog to data_mt, so if you have a solution starting there, without data_vs etc. that would be great.
I'm sure this is very basic, but I can't wrap my head around it - and as I have some 1500 variables that I want to look at, I'd appreciate your help =)
It may well be that my answer is already out there, but with the terminology I know I didn't find it explain for Dummies ;D
Edit:
To have a better understanding of what I am doing and what I am looking for, I hereby post my original code (not the mtcars example).
I have a dataset ds with 402 observations of 553 variables
The dataset comes from a study with human participants, some of which opted in for additional research mys or obs or both.
ds$mys <- 0
ds$mys[ which(ds$staffmystery_p == "Yes" ) ] <- 1
ds$obs <- 0
ds$obs[ which( !is.na(ds$sales_time)) ] <- 1
The 553 variables are either integers (e.g. for age or years of experience) or factors (e.g. sex or yes/no answers). I now want to compare some descriptive of the full dataset with the descriptives for the subsets and ideally also do a t-test for difference.
Currently I have just a very long list of code that reads more or less like the following (just much longer). This doesn't include t-tests.
describe(ds$age_b)
describe(dm$age_b)
describe(do$age_b)
prop.table(table(ds$sex_b))*100
prop.table(table(dm$sex_b))*100
prop.table(table(do$sex_b))*100
ds, dm and do are different datasets, but they are all just based on the above mentioned full dataset ds and the subsets ds$mys for dm and ds$obs for do
describe comes from the psych package and just lists descriptive statistics like mean or median etc. I don't need all of the metrics, mostly n, mean, median, sd and iqr.
The formula around 'prop.table' gives me a readout I can just copy into the excel tables I use for the final publications. I don't want automated output because I get asked all the time to add or change this, which is really just easier in excel than with automated output. (unless you know a much superior way ;)
Thank you so much!
Here is an option if we want to do this for different columns by group separately
library(dplyr)
library(purrr)
library(stringr)
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by(across(all_of(.x))) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
-output
# A tibble: 2 x 8
# vs Mean_cyl_vs Median_mpg_vs am Mean_cyl_am Median_mpg_am Mean_cyl_full Median_mpg_full
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 7.44 15.6 0 6.95 17.3 6.19 19.2
#2 1 4.57 22.8 1 5.08 22.8 6.19 19.2
If the package version is old, we can replace the across with group_by_at
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by_at(vars(.x)) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
Update
Based on the update, we could place the datasets in a list, do the transformations at once and return a list of descriptive statistics and the proportion table
out <- map(dplyr::lst(dm, ds, do), ~ {
dat <- .x %>%
mutate(mys = as.integer(staffmystery_p == 'Yes'),
obs = as.integer(!is.na(sales_time)))
age_b_desc <- describe(dat$age_b)
prop_table_out <- prop.table(table(dat$sex_b))*100
return(dplyr::lst(age_b_desc, prop_table_out))
}
)

Correlation between multiple variables of a data frame Group by a different variable

Assuming I have a data frame like the below (actual data frame has million observations). I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
I have tried “dplyr” library and combination of functions “group_by” and “summarize”. However, I am only able to get correlation between two columns and not the multiple columns.
library(dplyr)
df %>%
group_by(Signal_Up) %>%
summarize (COR=cor(signal, Net_return_at_t_plus1))
Data and desired result are given below.
Data
Desired Result
Correlation between "signal" Vs ["Net_return_at_t_plus1", "Net_return_at_t_plus5", "Net_return_at_t_plus10"]
Group by "Signal_Up"
Maybe you can try to use summarise_at to perform the correlation over several columns.
Here, I took the iris dataset as example:
library(dplyr)
iris %>% group_by(Species) %>%
summarise_at(vars(Sepal.Length:Petal.Length), ~cor(Petal.Width,.))
# A tibble: 3 x 4
Species Sepal.Length Sepal.Width Petal.Length
<fct> <dbl> <dbl> <dbl>
1 setosa 0.278 0.233 0.332
2 versicolor 0.546 0.664 0.787
3 virginica 0.281 0.538 0.322
For your dataset, you should try something like:
library(dplyr)
df %>% group_by(Signal_Up) %>%
summarise_at(vars(Net_return_at_t_plus1:Net_return_at_t_plus1), ~cor(signal,.))
Does it answer your question ?
NB: It is easier for people to try to solve your issue if you are providing reproducible example that they can easily copy/paste instead of adding it as an image (see: How to make a great R reproducible example)

For loop compilation error in R

I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Resources