Unpivot or transpose columns into rows R [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
Given the dataframe below:
dt <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Feb"),
"Location"=c("Store_1","Store_1","Store_2"),
"Apples"=c(100, 150, 120),
"Oranges"=c(50, 70, 50))
Year Month Location Apples Oranges
1 2020 Jan Store_1 100 50
2 2020 Jan Store_1 150 70
3 2020 Feb Store_2 120 50
How can I turn this table into the following table, basically by maintaining the first three columns and unpivoting the next two columns.
Year Month Location Type Values
1 2020 Jan Store_1 Apple 100
2 2020 Jan Store_1 Apple 150
3 2020 Feb Store_2 Apple 120
4 2020 Jan Store_1 Orange 50
5 2020 Jan Store_1 Orange 70
6 2020 Feb Store_2 Orange 50
Any hints or tips on this?

We can use pivot_longer from tidyr
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = Apples:Oranges, names_to = 'Type',
values_to = 'Values') %>%
arrange(Year, Type)
-output
# A tibble: 6 x 5
# Year Month Location Type Values
# <dbl> <chr> <chr> <chr> <dbl>
#1 2020 Jan Store_1 Apples 100
#2 2020 Jan Store_1 Apples 150
#3 2020 Feb Store_2 Apples 120
#4 2020 Jan Store_1 Oranges 50
#5 2020 Jan Store_1 Oranges 70
#6 2020 Feb Store_2 Oranges 50

Related

Filling missing weeks of consecutive year with values

I want to insert missing weeks for each household_id, channel combination so that weeks becomes in sequence. The corresponding duration column will be inserted with 0 and other columns value remains same.
Below is the dataset.
For e.g. household_id 100 and channel A: missing weeks are 37,39 and 41. I want these weeks to be inserted and duration will be 0.
But For household_id 101 and channel C: Two years are involved, 2019 and 2020. Missing are weeks 52 of 2019 and week 3 of 2020.
what I tried is below using complete function
library(tidyr)
library(dplyr)
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
For Household_id 100 and channel A combination it worked fine. All weeks are now in sequence.
But for Household_id 101 and channel C it didn't worked. I want after inserting 52 week of 2019 it should go to 1st week of 2020.
I tried getting dates from week and year column thinking from exact date it may work
but not able to get that to work also.
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
Any help is greatly apprecited!
Here is the sample dataset with code:
library(dplyr)
library(tidyr)
data <- data.frame(Household_id = c(100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","C","C","C","C","D","D"),
duration = c(12,34,567,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,38,40,42,51,1,2,4,38,39),
year = c(2021,2021,2021,2021,2019,2020,2020,2020,2021,2021))
# imputing missing weeks and duration = 0 for each husehold channel combination
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
# Getting Date from week/year if it may help
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
You can try defining the dates, making the sequence and converting to weeks. I used lubridate for ease.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(Household_id,channel) %>%
mutate(new = paste0(year, '01-01'),
new = ymd(new) + 7 * week) %>%
complete(new = seq(min(new),max(new), by = 'week'), fill = list(duration=0)) %>%
mutate(year = replace(year, is.na(year), format(new, '%Y')[is.na(year)]),
week = week(new)) %>%
select(-new)
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 100 A 12 42 2021 37 2021
2 100 A 0 NA NA 38 2021
3 100 A 34 42 2021 39 2021
4 100 A 0 NA NA 40 2021
5 100 A 567 42 2021 41 2021
6 100 A 0 NA NA 42 2021
7 100 A 67 42 2021 43 2021
8 101 C 98 5 2020 52 2019
9 101 C 0 NA NA 53 2019
10 101 C 0 NA NA 1 2020
11 101 C 0 NA NA 2 2020
12 101 C 0 NA NA 3 2020
13 101 C 0 NA NA 4 2020
14 102 D 73 30 2021 39 2021
15 102 D 76 30 2021 40 2021
16 101 C 23 5 2020 2 2020
17 101 C 56 5 2020 3 2020
18 101 C 89 5 2020 5 2020

How to sort data in descending order based on every second value in R?

I am using dplyr for most of my data wrangling in R. Yet, I am having a hard time achieving this particular effect. Can't also seem to find the answer by googling either.
Assume I have data like this and what I want to achieve is to sort person-grouped data based on cash value from the year 2021. Below I show the outcome I wish to achieve. I am just missing my imagination on this one I guess. If I only had 2021 value I could simply use ... %>% arrange(desc(cash)) but I am not sure how to follow from here.
year person cash
0 2020 personone 29
1 2021 personone 40
2 2020 persontwo 17
3 2021 persontwo 13
4 2020 personthree 62
5 2021 personthree 55
And what I want to achieve is to sort this data in descending order based on values from the year 2021. So that the data should look like:
year person cash
0 2020 personthree 62
1 2021 personthree 55
2 2020 personone 29
3 2021 personone 40
4 2020 persontwo 17
5 2021 persontwo 13
One approach using a join:
df %>%
filter(year == 2021) %>%
# group_by(person) %>% slice(2) %>% ungroup() %>% #each person's yr2
arrange(-cash) %>%
select(-cash, -year) %>%
left_join(df)
Output:
person year cash
1 personthree 2020 62
2 personthree 2021 55
3 personone 2020 29
4 personone 2021 40
5 persontwo 2020 17
6 persontwo 2021 13
Another option:
library(dplyr)
dat %>%
group_by(person) %>%
mutate(maxcash = max(cash)) %>%
arrange(desc(maxcash)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash maxcash
# <int> <chr> <int> <int>
# 1 2020 personthree 62 62
# 2 2021 personthree 55 62
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 17
# 6 2021 persontwo 13 17
Or a one-liner, using base R as a helper:
dat %>%
arrange(-ave(cash, person, FUN = max))
# year person cash
# 4 2020 personthree 62
# 5 2021 personthree 55
# 0 2020 personone 29
# 1 2021 personone 40
# 2 2020 persontwo 17
# 3 2021 persontwo 13
Edit:
If instead of max you mean "always 2021's data", then:
dat %>%
group_by(person) %>%
mutate(cash2021 = cash[year == 2021]) %>%
arrange(desc(cash2021)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash cash2021
# <int> <chr> <int> <int>
# 1 2020 personthree 62 55
# 2 2021 personthree 55 55
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 13
# 6 2021 persontwo 13 13

Assigning Values in R by Date Range

I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1

Calculating cumulative sum for multiple columns in R

R newb, I'm trying to calculate the cumulative sum grouped by year, month, group and subgroup, also having multiple columns to calculate.
Sample of the data:
df <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
"Group"=c("A","A","A","B","A","B","B","B"),
"SubGroup"=c("a","a","b","b","a","b","a","b"),
"V1"=c(10,10,20,20,50,50,10,10),
"V2"=c(0,1,2,2,0,5,1,1))
Year Month Group SubGroup V1 V2
1 2020 Jan A a 10 0
2 2020 Jan A a 10 1
3 2020 Jan A b 20 2
4 2020 Jan B b 20 2
5 2020 Feb A a 50 0
6 2020 Feb B b 50 5
7 2020 Feb B a 10 1
8 2020 Feb B b 10 1
Resulting Table wanted:
Year Month Group SubGroup V1 V2
1 2020 Jan A a 20 1
2 2020 Feb A a 70 1
3 2020 Jan A b 20 2
4 2020 Feb A b 20 2
5 2020 Jan B a 0 0
6 2020 Feb B a 10 1
7 2020 Jan B b 20 2
8 2020 Feb B b 80 8
From Sample Table, on Jan 2020, the sum of Group 'A' Subgroup 'a' was 10+10 = 20... On Feb 2020, the value was 50, therefore 20 from Jan + 50 = 70, and so on...
If there is no value, it should consider 0.
I've tried few codes but none didn't get even close to the output I need. Would really appreciate if someone could help me with some tips for this problem.
This is a simple group_by/mutate problem. The columns V1, V2 are chosen with across and cumsum applied to them.
df$Month <- factor(df$Month, levels = c("Jan", "Feb"))
df %>%
group_by(Year, Group, SubGroup) %>%
mutate(across(V1:V2, ~cumsum(.x))) %>%
ungroup() %>%
arrange(Year, Group, SubGroup, Month)
## A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <chr> <fct> <chr> <chr> <dbl> <dbl>
#1 2020 Jan A a 10 0
#2 2020 Jan A a 20 1
#3 2020 Feb A a 70 1
#4 2020 Jan A b 20 2
#5 2020 Feb B a 10 1
#6 2020 Jan B b 20 2
#7 2020 Feb B b 70 7
#8 2020 Feb B b 80 8
If I understand what you are doing, you're taking the sum for each month, then doing the cumulative sums for the months. This is usuaully pretty easy in dplyr.
library(dplyr)
df %>%
group_by(Year, Month, Group, SubGroup) %>%
summarize(
V1_sum = sum(V1),
V2_sum = sum(V2)
) %>%
group_by(Year, Group, SubGroup) %>%
mutate(
V1_cumsum = cumsum(V1_sum),
V2_cumsum = cumsum(V2_sum)
)
# A tibble: 6 x 8
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 Feb A a 50 0 50 0
# 2 2020 Feb B a 10 1 10 1
# 3 2020 Feb B b 60 6 60 6
# 4 2020 Jan A a 20 1 70 1
# 5 2020 Jan A b 20 2 20 2
# 6 2020 Jan B b 20 2 80 8
But you'll notice that the monthly cumulative sums are backwards (i.e. January comes after February), because by default group_by groups alphabetically. Also, you don't see the empty values because dplyr doesn't fill them in.
To fix the order of the months, you can either make your months numeric (convert to dates) or turn them into factors. You can add back 'missing' combinations of the grouping variables by using aggregate in base R instead of dplyr::summarize. aggregate includes all combinations of the grouping factors. aggregate converts the missing values to NA, but you can replace the NA with 0 with tidyr::replace_na, for example.
library(dplyr)
library(tidyr)
df <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
"Group"=c("A","A","A","B","A","B","B","B"),
"SubGroup"=c("a","a","b","b","a","b","a","b"),
"V1"=c(10,10,20,20,50,50,10,10),
"V2"=c(0,1,2,2,0,5,1,1))
df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)
# Get monthly sums
df1 <- with(df, aggregate(
list(V1_sum = V1, V2_sum = V2),
list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
FUN = sum, drop = FALSE
))
df1 <- df1 %>%
# Replace NA with 0
mutate(
V1_sum = replace_na(V1_sum, 0),
V2_sum = replace_na(V2_sum, 0)
) %>%
# Get cumulative sum across months
group_by(Year, Group, SubGroup) %>%
mutate(V1cumsum = cumsum(V1_sum),
V2cumsum = cumsum(V2_sum)) %>%
ungroup() %>%
select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)
This gives the same result as your example:
# # A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <dbl> <ord> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 20 1
# 2 2020 Feb A a 70 1
# 3 2020 Jan B a 0 0
# 4 2020 Feb B a 10 1
# 5 2020 Jan A b 20 2
# 6 2020 Feb A b 20 2
# 7 2020 Jan B b 20 2
# 8 2020 Feb B b 80 8
library(dplyr)
library(zoo)
df %>%
arrange(as.yearmon(paste0(Year, '-', Month), '%Y-%b'), Group, SubGroup) %>%
group_by(Year, Group, SubGroup) %>%
mutate(
V1 = cumsum(V1),
V2 = cumsum(V2)
) %>%
arrange(Year, Group, SubGroup, as.yearmon(paste0(Year, '-', Month), '%Y-%b')) #for desired output ordering
# A tibble: 8 x 6
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1 V2
# <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 10 0
# 2 2020 Jan A a 20 1
# 3 2020 Feb A a 70 1
# 4 2020 Jan A b 20 2
# 5 2020 Feb B a 10 1
# 6 2020 Jan B b 20 2
# 7 2020 Feb B b 70 7
# 8 2020 Feb B b 80 8

Use of EXCEL OFFSET IN R for a range of values and multiple times

This is file I want to append my data in
Collection A
Jan
Feb
March
April
Collection B
Jan
Feb
March
April
Revenue A
Jan
Feb
March
April
Revenue B
Jan
Feb
March
April
The file I want to pull my data from looks like this:
Collection Month Collection A Collection B Revenue Month Revenue A Revenue B
Collection January 1 5 Revenue January 4 8
Collection February 2 6 Revenue February 3 7
Collection March 3 7 Revenue March 2 6
Collection April 4 8 Revenue April 1 5
I want the final output to look like this:
Collection A
Jan 1
Feb 2
March 3
April 4
Collection B
Jan 5
Feb 6
March 7
April 8
Revenue A
Jan 4
Feb 3
March 2
April 1
Revenue B
Jan 8
Feb 7
March 6
April 5
I am able to run this on excel using OFFSET and INDIRECT function. But I want to automate it better for future purposes so I am trying it on R.
I am really stuck on how to combine the two datasets to find the desired output. It seems like an impossible task for me. I have played around with several functions like select, subset and arrange by none of them have helped me progress.
I will be glad if someone can help me out with this.
Here's a way to achieve that output. Note that I removed spaces from the column names in the sample data in order to make it easier to read into R. You didn't specify what you wanted the column names of the output dataframe to be so as given they make little sense.
library(tidyverse)
tbl <- read_table2(
"Collection Month CollectionA CollectionB Revenue Month RevenueA RevenueB
Collection January 1 5 Revenue January 4 8
Collection February 2 6 Revenue February 3 7
Collection March 3 7 Revenue March 2 6
Collection April 4 8 Revenue April 1 5"
)
#> Warning: Duplicated column names deduplicated: 'Month' => 'Month_1' [6]
tbl %>%
select(-Collection, -Revenue, -Month_1) %>%
gather(variable, value, -Month) %>%
group_by(variable) %>%
group_modify(~ add_row(.x, Month = .y$variable, value = NA, .before = 1)) %>%
ungroup() %>%
select(-variable)
#> # A tibble: 20 x 2
#> Month value
#> <chr> <dbl>
#> 1 CollectionA NA
#> 2 January 1
#> 3 February 2
#> 4 March 3
#> 5 April 4
#> 6 CollectionB NA
#> 7 January 5
#> 8 February 6
#> 9 March 7
#> 10 April 8
#> 11 RevenueA NA
#> 12 January 4
#> 13 February 3
#> 14 March 2
#> 15 April 1
#> 16 RevenueB NA
#> 17 January 8
#> 18 February 7
#> 19 March 6
#> 20 April 5
Created on 2019-06-18 by the reprex package (v0.3.0)

Resources