I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Related
I'm new to plyr and dplyr and seriously don't get it. I have managed my way around some functions, but I struggle with really basic stuff such as the following example.
Taking mtcars, I have different overlapping subsets, such as vs = 1 and am = 1
I now want to run the same analysis, in this case median() for one var over the different subsets, and another analysis, such as mean() for another var.
This should give me in the end the same result, such as the following code - just much shorter:
data_mt <- mtcars # has binary dummy vars for grouping
data_vs <- data_mt[ which(data_mt$vs == 1 ), ]
data_am <- data_mt[ which(data_mt$am == 1 ), ]
median(data_mt$mpg)
median(data_vs$mpg)
median(data_am$mpg)
mean(data_mt$cyl)
mean(data_vs$cyl)
mean(data_am$cyl)
In my real example, I have an analog to data_mt, so if you have a solution starting there, without data_vs etc. that would be great.
I'm sure this is very basic, but I can't wrap my head around it - and as I have some 1500 variables that I want to look at, I'd appreciate your help =)
It may well be that my answer is already out there, but with the terminology I know I didn't find it explain for Dummies ;D
Edit:
To have a better understanding of what I am doing and what I am looking for, I hereby post my original code (not the mtcars example).
I have a dataset ds with 402 observations of 553 variables
The dataset comes from a study with human participants, some of which opted in for additional research mys or obs or both.
ds$mys <- 0
ds$mys[ which(ds$staffmystery_p == "Yes" ) ] <- 1
ds$obs <- 0
ds$obs[ which( !is.na(ds$sales_time)) ] <- 1
The 553 variables are either integers (e.g. for age or years of experience) or factors (e.g. sex or yes/no answers). I now want to compare some descriptive of the full dataset with the descriptives for the subsets and ideally also do a t-test for difference.
Currently I have just a very long list of code that reads more or less like the following (just much longer). This doesn't include t-tests.
describe(ds$age_b)
describe(dm$age_b)
describe(do$age_b)
prop.table(table(ds$sex_b))*100
prop.table(table(dm$sex_b))*100
prop.table(table(do$sex_b))*100
ds, dm and do are different datasets, but they are all just based on the above mentioned full dataset ds and the subsets ds$mys for dm and ds$obs for do
describe comes from the psych package and just lists descriptive statistics like mean or median etc. I don't need all of the metrics, mostly n, mean, median, sd and iqr.
The formula around 'prop.table' gives me a readout I can just copy into the excel tables I use for the final publications. I don't want automated output because I get asked all the time to add or change this, which is really just easier in excel than with automated output. (unless you know a much superior way ;)
Thank you so much!
Here is an option if we want to do this for different columns by group separately
library(dplyr)
library(purrr)
library(stringr)
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by(across(all_of(.x))) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
-output
# A tibble: 2 x 8
# vs Mean_cyl_vs Median_mpg_vs am Mean_cyl_am Median_mpg_am Mean_cyl_full Median_mpg_full
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 7.44 15.6 0 6.95 17.3 6.19 19.2
#2 1 4.57 22.8 1 5.08 22.8 6.19 19.2
If the package version is old, we can replace the across with group_by_at
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by_at(vars(.x)) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
Update
Based on the update, we could place the datasets in a list, do the transformations at once and return a list of descriptive statistics and the proportion table
out <- map(dplyr::lst(dm, ds, do), ~ {
dat <- .x %>%
mutate(mys = as.integer(staffmystery_p == 'Yes'),
obs = as.integer(!is.na(sales_time)))
age_b_desc <- describe(dat$age_b)
prop_table_out <- prop.table(table(dat$sex_b))*100
return(dplyr::lst(age_b_desc, prop_table_out))
}
)
I have a very big dataset that I'd like to illustrate using plotly in R.
A sample of my dataset is shown below:
> new_data_2
# Groups: newdatum [8]
date activity totaal
<date> <fct> <int>
1 2019-11-21 N11 144
2 2019-09-22 N11 129
3 2019-05-15 N22 117
4 2019-01-23 N22 12
5 2019-07-04 N22 12
6 2019-07-18 N22 12
...
For every activity I want to display the amount (totaal) per date (date) in a time series plot.
Somehow I don't get it right in R. Somehow I need to group my activity to display, but I can't figure it out.
new_data_2 %>%
group_by(activity) %>%
plot_ly(x=new_data_2$newdatum) %>%
add_lines(y=~new_data_2$totaal, color = ~factor(newdatum))
It does display an empty plot and not with the 'activity' on the left side.
What i want to achieve is:
You're on the right track, but after the group_by() you need to tell R to do something to the groups.
new_data_2 %>%
group_by(activity, date) %>% # use two groupings since you want by activity & date
summarise(totaal_2 = sum(totaal))
That should get to the dataframe you're looking for. You can use ggplot & plotly on it from there.
I would recommend reshaping the data first (as above), saving it as a new object, and then graphing it. Doing it this way helps you see each step along the way. Pipes %>% are great, but can make each step difficult to see.
This might not be very obvious at first, but the structure of your data is ideal for plot with multiple time series. You don't even need to worry with the group_by function. Your dataset seems to hava a long format where the dates in the date column and the names in activity column are not unique. But you will have only one variable per activity and date.
Given the correct specifications, plot_ly() will group your data using color=~activity like this: p <- plot_ly(new_data2, x = ~date, y = ~totall, color = ~activity) %>% add_lines(). Since you haven't provided a data sample that is large enough, I'll use the built-in dataset economics_long to show you how you can do this. First of all, notice how the structure of my sampled dataset matches yours:
date variable value
1 1967-07-01 psavert 12.5
2 1967-08-01 psavert 12.5
3 1967-09-01 psavert 11.7
4 1967-10-01 psavert 12.5
5 1967-11-01 psavert 12.5
6 1967-12-01 psavert 12.1
...
Plot:
Code:
library(plotly)
library(dplyr)
# data
data("economics_long")
df <- data.frame(economics_long)
# keep only some variables that have values on a comparable level
df <- df %>% filter(!(variable %in% c('pop', 'pce', 'unemploy')))
# plotly time series
p <- plot_ly(df, x = ~date, y = ~value, color = ~variable) %>%
add_lines()
# show plot
p
Am calculating error rates between two different forecasting methods.
My basic approach is to get group by nk and calculate the errors to compare and choose the one which have less error rate value. The issue is am getting MAP1E_arima_ds and MAPE_cagr_ds is all the same value some how the group_by function is not working while calculating.
Here is something I tried
group_by(nk) %>%
mutate(MAP1E_arima_ds=sum(temp2$ABS_arima_error_ds)/nrow(temp2)) %>%
mutate(MAPE_cagr_ds=sum(temp2$ABS_cagr_error_ds)/nrow(temp2))
So finally expected like
nk MAP1E_arima_ds MAPE_cagr_ds
1-G0175 value_x value_y
1-H0182 value_z value_a
so that I can compare error rate and choose forecasting method with less error rate.
If I understand you correctly, I think what you are looking for is this
library(dplyr)
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds=sum(ABS_arima_error_ds)/n(),
MAPE_cagr_ds=sum(ABS_cagr_error_ds)/n())
# A tibble: 2 x 3
# nk MAP1E_arima_ds MAPE_cagr_ds
# <chr> <dbl> <dbl>
#1 1-G0175 14.7 3.38
#2 1-H0182 2.91 7.40
which is actually mean
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds = mean(ABS_arima_error_ds),
MAPE_cagr_ds = mean(ABS_cagr_error_ds))
Moreover, after copying your dput it seems that your data is already grouped by nk, so the following would also give the same result
df %>%
summarise(MAP1E_arima_ds=mean(ABS_arima_error_ds),
MAPE_cagr_ds=mean(ABS_cagr_error_ds))
I have a large data set with over 2000 observations. The data involves toxin concentrations in animal tissue. My response variable is myRESULT and I have multiple observations per ANALYTE of interest. I need to remove the outliers, as defined by numbers more than three SD away from the mean, from within each ANALYTE group.
While I realize that I should not remove outliers from a dataset normally, I would still like to know how to do it in R.
Here is a small portion of what my data look like:
It's subsetting by group, which can be done in different ways. With dplyr, you use group_by to set grouping, then filter to subset rows, passing it an expression that will calculate return TRUE for rows to keep, and FALSE for outliers.
For example, using iris and 2 standard deviations (everything is within 3):
library(dplyr)
iris_clean <- iris %>%
group_by(Species) %>%
filter(abs(Petal.Length - mean(Petal.Length)) < 2*sd(Petal.Length))
iris_clean %>% count()
#> # A tibble: 3 x 2
#> # Groups: Species [3]
#> Species n
#> <fct> <int>
#> 1 setosa 46
#> 2 versicolor 47
#> 3 virginica 47
With a split-apply-combine approach in base R,
do.call(rbind, lapply(
split(iris, iris$Species),
function(x) x[abs(x$Petal.Length - mean(x$Petal.Length)) < 2*sd(x$Petal.Length), ]
))
I have a data.frame with a head that looks like this:
> head(movies_by_yr)
Source: local data frame [6 x 4]
Groups: YR_Released [6]
Movie_Title YR_Released Rating Num_Reviews
<fctr> <fctr> <dbl> <int>
1 The Shawshank Redemption 1994 9.2 1773755
2 The Godfather 1972 9.2 1211083
3 The Godfather: Part II 1974 9.0 832342
4 The Dark Knight 2008 8.9 1755341
5 12 Angry Men 1957 8.9 477276
6 Schindler's List 1993 8.9 909358
Note that when created, I specified stringsAsFactors=FALSE, so I believe the columns that got converted to factors were converted when I grouped the data frame in preparation for the next step:
movies_by_yr <- group_by(problem1_data, YR_Released)
Now we come to the problem. The goal is to group by YR_Released so we can get counts of records by year. I thought the next step would be something like this, but it throws an error and I am not sure what i am doing wrong:
summarise(movies_by_yr, total = nrow(YR_Released))
I choose nrow because once you have a grouping, the number of rows within that grouping should be the count. Can someone point me to what I am doing wrong?
The error thrown is:
Error in summarise_impl(.data, dots) : Not a vector
But I know this data.frame was created from a series of vectors and whatever is different from the sample code from class and my attempt, I am just not seeing it. Hoping someone can answer this ...
Let's use data that everyone has, like the built-in mtcars data.frame, to make this more useful for future readers.
If you look at the documentation ?nrow you'll see that function is meant to be called on a data.frame or matrix. You are calling it on a column, YR_Released. There is a vector-specific variant of the function nrow, called (confusingly) NROW - if you try that instead, it may work.
But even if it does, the intended dplyr way to count rows is with n(), like this:
mycars <- mtcars
mycars <- group_by(mycars, cyl)
summarise(mycars, total = NROW(cyl))
#> # A tibble: 3 x 2
#> cyl total
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
And because it's such a common use case, the wrapper function count() will save you some code:
mtcars %>%
count(cyl)
Try this (I think it's what you want)
table(movies_by_year$YR_Released)