I tried to calculate the quarterly growth rate in sales for different stores. However, I group by my data several times until it became the following status:
How can I generate the growth rate table based on this equation: (Q3-Q2)/Q2?
The code I programmed so far is as follows. Thank you.
Store
Quarter
Weekly_Sales
1
Q2
60428109
1
Q3
20253948
2
Q2
74356864
2
Q3
24303355
3
Q2
15459190
3
Q3
5298005
4
Q2
79302989
4
Q3
27796792
5
Q2
12523263
5
Q3
4163791
library("dplyr")
library("lubridate")
Walmart_data_set <- read.csv("Walmart_Store_sales.csv")
Walmart_data_set$Date <- as.Date(Walmart_data_set$Date, "%d-%m-%Y")
Walmart_data_set["Month"] <- month(Walmart_data_set$Date)
Walmart_data_set["Quarter"] <- quarters(Walmart_data_set$Date)
Walmart_data_set["Year"] <- format(Walmart_data_set$Date, format ="%Y")
Q23_2012_Sales<- filter(Walmart_data_set, Year == "2012" & Quarter == "Q3" | Quarter == "Q2")
Sales_Store_quarter = Q23_2012_Sales %>% group_by(Store, Quarter) %>%
summarise(Weekly_Sales = sum(Weekly_Sales),
.groups = 'drop')
You can do it like this:
df %>%
arrange(Store, Quarter) %>%
group_by(Store) %>%
mutate(growth = (Weekly_Sales - lag(Weekly_Sales))/lag(Weekly_Sales))
Output:
Store Quarter Weekly_Sales growth
<dbl> <chr> <dbl> <dbl>
1 1 Q2 60428109 NA
2 1 Q3 20253948 -0.665
3 2 Q2 74356864 NA
4 2 Q3 24303355 -0.673
5 3 Q2 15459190 NA
6 3 Q3 5298005 -0.657
7 4 Q2 79302989 NA
8 4 Q3 27796792 -0.649
9 5 Q2 12523263 NA
10 5 Q3 4163791 -0.668
Don't group by Quarter.
library(dplyr)
dat %>%
arrange(Store, Quarter) %>%
group_by(Store) %>%
mutate(Growth = c(NA, diff(Weekly_Sales)) / dplyr::lag(Weekly_Sales)) %>%
ungroup()
# . + >
# # A tibble: 10 x 4
# Store Quarter Weekly_Sales Growth
# <int> <chr> <int> <dbl>
# 1 1 Q2 60428109 NA
# 2 1 Q3 20253948 -0.665
# 3 2 Q2 74356864 NA
# 4 2 Q3 24303355 -0.673
# 5 3 Q2 15459190 NA
# 6 3 Q3 5298005 -0.657
# 7 4 Q2 79302989 NA
# 8 4 Q3 27796792 -0.649
# 9 5 Q2 12523263 NA
# 10 5 Q3 4163791 -0.668
This method assumes that you always have a Q2 for each Q3. (The converse would be you have more history in your data, with some stores perhaps gapping a quarter or two.)
Data
dat <- structure(list(Store = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L), Quarter = c("Q2", "Q3", "Q2", "Q3", "Q2", "Q3", "Q2", "Q3", "Q2", "Q3"), Weekly_Sales = c(60428109L, 20253948L, 74356864L, 24303355L, 15459190L, 5298005L, 79302989L, 27796792L, 12523263L, 4163791L)), class = "data.frame", row.names = c(NA, -10L))
Related
I have the following DataFrame in R:
Y ... Price Year Quantity Country
010190 ... 4781 2021 4 Germany
010190 ... 367 2021 3 Germany
010190 ... 4781 2021 6 France
010190 ... 250 2021 3 France
020190 ... 690 2021 NA USA
020190 ... 10 2021 6 USA
...... ... .... .. ...
217834 ... 56 2021 3 USA
217834 ... 567 2021 9 USA
As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:
Y Price Year Quantity Country
01 5148 2021 7 Germany
01 5031 2021 9 USA
02 700 2021 6 USA
.. .... ... .... ...
21 623 2021 12 USA
You can use group_by and summarize from dplyr
library(dplyr)
df %>%
mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = '%02d')) %>%
group_by(Y, Year, Country) %>%
summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups: Y, Year [3]
#> Y Year Country Price Quantity
#> <chr> <int> <chr> <int> <int>
#> 1 01 2021 France 5031 9
#> 2 01 2021 Germany 5148 7
#> 3 02 2021 USA 700 NA
update: request:
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))
We could use substr to get the first two characters from Y and group_by and summarise() with sum()
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(Price = sum(Price, na.rm = TRUE),
Quantity = sum(Quantity, na.rm = TRUE)
)
Y Year Country Price Quantity
<chr> <dbl> <chr> <dbl> <dbl>
1 01 2021 France 5031 9
2 01 2021 Germany 5148 7
3 02 2021 USA 700 6
4 21 2021 USA 623 12
Using aggregate and the substring of Y.
aggregate(cbind(Quantity, Price) ~ Y + Year + Country,
transform(dat, Y=substr(Y, 1, 2)), sum)
# Y Year Country Quantity Price
# 1 10 2021 France 9 5031
# 2 10 2021 Germany 7 5148
# 3 20 2021 USA 7 700
# 4 21 2021 USA 12 623
Data:
dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L,
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...",
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L,
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla",
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany",
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA,
-8L))
New to r and I'm having difficulty getting the counts I'm after. I have a dataset that contains several columns of various counts per year. Here is an example:
huc_code_eight
year
count_1
count_2
6010105
1946
4
4
6010105
1947
6
0
6010105
1948
2
0
6010105
1957
4
4
6020001
1957
2
0
8010203
1957
0
0
I want to aggregate these counts based upon consecutive years, grouped by huc_code_eight. The expected output would look like:
huc_code_eight
year
count_1
count_2
6010105
1946 - 1948
12
4
6010105
1957
4
4
6020001
1957
2
0
8010203
1957
0
0
I would like to avoid iterating through the data and summing these manually, but, though I've found many examples of aggregating in r, I've been unable to successfully refactor them to fit my use case.
Any help would be greatly appreciated!
Here is a data.table approach
set as data.table,, get the subsequent year, set to 1 if NA, and create run-length id
dat <- setDT(dat)[, yr:= year-shift(year),by=huc_code_eight][is.na(yr), yr:=1][,grp:=rleid(huc_code_eight,yr)]
create the character year (range if necessary, and sum of counts, by id
dat[,.(
year = fifelse(.N>1,paste0(min(year),"-",max(year)),paste0(year, collapse="")),
count_1=sum(count_1),count_2=sum(count_2)),
by=.(grp,huc_code_eight)][,grp:=NULL][]
Output:
huc_code_eight year count_1 count_2
1: 6010105 1946-1948 12 4
2: 6010105 1957 4 4
3: 6020001 1957 2 0
4: 8010203 1957 0 0
We can create a grouping column based on difference of adjacent elements in 'year' along with 'huc_code_eight' and then summarise
library(dplyr)
library(stringr)
df1 %>%
group_by(huc_code_eight) %>%
mutate(year_grp = cumsum(c(TRUE, diff(year) != 1))) %>%
group_by(year_grp, .add = TRUE) %>%
summarise(year = if(n() > 1)
str_c(range(year), collapse = ' - ') else as.character(year),
across(starts_with('count'), sum, na.rm = TRUE), .groups = 'drop') %>%
dplyr::select(-year_grp)
-output
# A tibble: 4 × 4
huc_code_eight year count_1 count_2
<int> <chr> <int> <int>
1 6010105 1946 - 1948 12 4
2 6010105 1957 4 4
3 6020001 1957 2 0
4 8010203 1957 0 0
data
df1 <- structure(list(huc_code_eight = c(6010105L, 6010105L, 6010105L,
6010105L, 6020001L, 8010203L), year = c(1946L, 1947L, 1948L,
1957L, 1957L, 1957L), count_1 = c(4L, 6L, 2L, 4L, 2L, 0L), count_2 = c(4L,
0L, 0L, 4L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14
Give a dataframe df as follows:
df <- structure(list(year = c(2001, 2002, 2003, 2004), `1` = c(22.0775,
24.2460714285714, 29.4039285714286, 27.7110714285714), `2` = c(27.2535714285714,
35.9996428571429, 26.39, 27.8557142857143), `3` = c(24.7710714285714,
25.4428571428571, 15.1142857142857, 19.9657142857143)), row.names = c(NA,
-4L), groups = structure(list(year = c(2001, 2002, 2003, 2004
), .rows = structure(list(1L, 2L, 3L, 4L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Out:
year 1 2 3
0 2001 22.07750 27.25357 24.77107
1 2002 24.24607 35.99964 25.44286
2 2003 29.40393 26.39000 15.11429
3 2004 27.71107 27.85571 19.96571
For column 1, 2 and 3, how could I calculate year-to-year absolute change?
The expected result will like this:
year 1 2 3
0 2002 2.16857 8.74607 0.67179
1 2003 5.15786 9.60964 10.32857
2 2004 1.69286 1.46571 4.85142
The final objective is to compare values of 1, 2, 3 columns across all years, find the largest change year and column, at this example, it should be 2003 and column 3.
How could I do that in R? Thanks.
You can use :
library(dplyr)
data <- df %>% ungroup %>% summarise(across(-1, ~abs(diff(.))))
data
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 2.17 8.75 0.672
#2 5.16 9.61 10.3
#3 1.69 1.47 4.85
To get max change
mat <- which(data == max(data), arr.ind = TRUE)
mat
# row col
#[1,] 2 3
#Year name
df$year[mat[, 1] + 1]
#[1] 2003
#Column name
mat[, 2]
#col
# 3
You can try:
library(reshape2)
library(dplyr)
#Melt
Melted <- reshape2::melt(df,id.vars = 'year')
#Group
Melted %>% group_by(variable) %>% mutate(Diff=c(0,abs(diff(value)))) %>% ungroup() %>%
filter(Diff==max(Diff))
# A tibble: 1 x 4
year variable value Diff
<dbl> <fct> <dbl> <dbl>
1 2003 3 15.1 10.3
We can apply the diff on the entire dataset by converting the numeric columns of interest to matrix in base R
cbind(year = df$year[-1], abs(diff(as.matrix(df[-1]))))
# year 1 2 3
#[1,] 2002 2.168571 8.746071 0.6717857
#[2,] 2003 5.157857 9.609643 10.3285714
#[3,] 2004 1.692857 1.465714 4.8514286
I have a data frame in which the values are stored as characters. However, many values contain two numbers that need to be added together. Example:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 3+6 2+10 8 13+2
Product 2 6 4+0 <NA> 5
Product 3 <NA> 5+9 3+1 11
Is there a way to go through the whole data frame and replace all cells containing characters like "3+6" with new values equal to their sum? I assume this would involve coercing the characters to numeric or integers, but I don't know how that would be possible for values with the + sign in them. I would like the example data frame to end up looking like this:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 9 12 8 15
Product 2 6 4 <NA> 5
Product 3 <NA> 14 4 11
Here's an easier example:
dat <- data.frame(a=c("3+6", "10"), b=c("12", NA), c=c("3+4", "5+6"))
dat
## a b c
## 1 3+6 12 3+4
## 2 10 <NA> 5+6
apply(dat, 1:2, function(x) eval(parse(text=x)))
## a b c
## [1,] 9 12 7
## [2,] 10 NA 11
Using R itself to do the computation with eval and parse does the trick.
Here is one option with gsubfn without using eval(parse. We convert the 'data.frame' to 'matrix' (as.matrix(dat)). We match the numbers ([0-9]+), capture it as a group using parentheses ((..)) followed by +, followed by second set of numbers, and replace it by converting to numeric class and then do the +. The output can be assigned back to the original dataset to get the same structure as in 'dat'.
library(gsubfn)
dat[] <- as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.matrix(dat)))
dat
# 2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
#Product 1 9 12 8 15
#Product 2 6 4 NA 5
#Product 3 NA 14 4 11
Or we can loop the columns with lapply and perform the replacement with gsubfn for each of the columns.
dat[] <- lapply(dat, function(x) as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.character(x))))
data
dat <- structure(list(`2014 Q1 Sales` = structure(c(1L, 2L, NA), .Label = c("3+6",
"6"), class = "factor"), `2014 Q2 Sales` = structure(1:3, .Label = c("2+10",
"4+0", "5+9"), class = "factor"), `2014 Q3 Sales` = structure(c(2L,
NA, 1L), .Label = c("3+1", "8"), class = "factor"), `2014 Q4 Sales` = structure(c(2L,
3L, 1L), .Label = c("11", "13+2", "5"), class = "factor")), .Names = c("2014 Q1 Sales",
"2014 Q2 Sales", "2014 Q3 Sales", "2014 Q4 Sales"), class = "data.frame", row.names = c("Product 1",
"Product 2", "Product 3"))