I've the following table.
Date
Cat
15/2/1999
A
15/2/1999
A
15/2/1999
B
15/5/1999
A
15/5/1999
B
15/10/1999
C
15/10/1999
C
15/2/2001
A
15/2/2001
A
15/6/2001
B
15/6/2001
B
15/6/2001
C
15/11/2001
C
15/11/2001
C
I would like to apply pivot_wider (or any other similar functions) to it and also accounting for the Date and Year column as seen below. The Cat column is being split based on the variable A, B and C and the count is being displayed.
Month
Year
A
B
C
Total
February
1999
2
1
0
3
May
1999
1
1
0
2
October
1999
0
0
2
2
February
2001
2
0
0
2
June
2001
0
2
1
3
November
2001
0
0
2
2
Does anyone here knows how I can do both together? Thanks
You can do this with tidyverse packages. First, format your date column as date, then count by month, pivot to wider and format the table.
library(tidyverse)
data %>%
mutate(Date = as.Date(Date, format = "%d/%m/%Y")) %>%
group_by(Cat, month = lubridate::floor_date(Date, "month")) %>%
count(Cat) %>%
pivot_wider(names_from = Cat, values_from = n, values_fill = 0) %>%
mutate(year = year(month), .before = "A",
month = month(month, label = T, abbr = F)) %>%
mutate(Total = rowSums(across(A:C))) %>%
arrange(year)
month year A B C Total
<ord> <dbl> <int> <int> <int> <dbl>
1 February 1999 2 1 0 3
2 May 1999 1 1 0 2
3 October 1999 0 0 2 2
4 February 2001 2 0 0 2
5 June 2001 0 2 1 3
6 November 2001 0 0 2 2
data
data <- structure(list(Date = c("15/2/1999", "15/2/1999", "15/2/1999",
"15/5/1999", "15/5/1999", "15/10/1999", "15/10/1999", "15/2/2001",
"15/2/2001", "15/6/2001", "15/6/2001", "15/6/2001", "15/11/2001",
"15/11/2001"), Cat = c("A", "A", "B", "A", "B", "C", "C", "A",
"A", "B", "B", "C", "C", "C")), class = "data.frame", row.names = c(NA,
-14L))
Another possible solution:
library(tidyverse)
library(lubridate)
df <- data.frame(
stringsAsFactors = FALSE,
Date = c("15/2/1999",
"15/2/1999","15/2/1999","15/5/1999","15/5/1999",
"15/10/1999","15/10/1999","15/2/2001","15/2/2001",
"15/6/2001","15/6/2001","15/6/2001","15/11/2001",
"15/11/2001"),
Cat = c("A","A","B","A",
"B","C","C","A","A","B","B","C","C","C")
)
df %>%
mutate(Month = month(Date, label = TRUE), Year = year(dmy(Date))) %>%
pivot_wider(id_cols = c(Month, Year), names_from = Cat,
values_from = Cat, values_fn = length, values_fill = 0) %>%
mutate(Total = rowSums(.[3:5]))
#> # A tibble: 6 × 6
#> Month Year A B C Total
#> <ord> <dbl> <int> <int> <int> <dbl>
#> 1 Feb 1999 2 1 0 3
#> 2 May 1999 1 1 0 2
#> 3 Oct 1999 0 0 2 2
#> 4 Feb 2001 2 0 0 2
#> 5 Jun 2001 0 2 1 3
#> 6 Nov 2001 0 0 2 2
Related
I have a dataframe that looks like
id = c("1", "2", "3")
IN1999 = c(1, 1, 0)
IN2000 = c(1, 0, 1)
TEST1999 = c(10, 12, NA)
TEST2000 = c(15, NA, 11)
df <- data.frame(id, IN1999, IN2000, TEST1999, TEST2000)
I am trying to use pivot_longer to change it into this form:
id year IN TEST
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11
My current code looks like this
df %>%
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = 4)
but obviousely by setting names_sep = 4, r cuts IN1999 and IN2000 at the wrong place. How can I set the argument so that r can separate the column name from the last four digits?
The names_sep-argument in pivot_longer also accepts regex expressions, that will allow you to split before the occurrence of four digits as in this example below:
library(tidyr)
df |>
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = "(?=\\d{4})")
Output:
# A tibble: 6 × 4
id year IN TEST
<chr> <chr> <dbl> <dbl>
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11
Fairly new to R, ended up in the following situation: I want to create a summary row for each group in the dataframe based on Year and Model, where a value of each row would be based on the subtraction of value of one Variable from others in the group.
df <- data.frame(Model = c(1,1,1,2,2,2,2,2,2,2,2,2,2),
Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2030, 2030, 2030, 2040, 2040, 2040),
Variable = c("A", "B", "C", "A", "B", "C", "D", "A", "C", "E", "A", "C", "D"),
value = c(15, 2, 5, 25, 6, 4, 4, 41, 24,1, 15, 3, 2))
I have managed to create a new row for each group, so it already has a Year and a Variable name that I manually specified using:
df <- df %>% group_by(Model, Year) %>% group_modify(~ add_row(., Variable = "New", .before=0))
However, I am struggling to create an equation from which I want to calculate the value.
What I want to have instead of NAs: value of A-B-D in each group
Would appreciate any help. My first thread here, pardon for any inconvenience.
You could pivot wide and then back; this would add rows with zeros where missing:
library(dplyr); library(tidyr)
df %>%
pivot_wider(names_from = Variable, values_from = value, values_fill = 0) %>%
mutate(new = A - B - D) %>%
pivot_longer(-c(Model, Year), names_to = "Variable")
# A tibble: 24 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 D 0
5 1 2020 E 0
6 1 2020 new 13 # 15 - 2 - 0 = 13
7 2 2020 A 25
8 2 2020 B 6
9 2 2020 C 4
10 2 2020 D 4
# … with 14 more rows
EDIT - variation where we leave the missing values and use coalesce(x, 0) to allow subtraction to treat NA's as zeroes. The pivot_wider creates NA's in the missing spots, but we can exclude these in the pivot_longer using values_drop_na = TRUE.
df %>%
pivot_wider(names_from = Variable, values_from = value) %>%
mutate(new = A - coalesce(B,0) - coalesce(D,0)) %>%
pivot_longer(-c(Model, Year), names_to = "Variable", values_drop_na = TRUE)
# A tibble: 17 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 new 13
5 2 2020 A 25
6 2 2020 B 6
7 2 2020 C 4
8 2 2020 D 4
9 2 2020 new 15
10 2 2030 A 41
11 2 2030 C 24
12 2 2030 E 1
13 2 2030 new 41
14 2 2040 A 15
15 2 2040 C 3
16 2 2040 D 2
17 2 2040 new 13
having a dataframe with sales per customer and months.
df <-
data.frame(
stringsAsFactors = FALSE,
date = c("jan","jan","jan","jan",
"jan","jan","jan","feb","feb","feb","feb","feb",
"feb","feb"),
customer = c("john","john","john","Mary",
"Mary","Mary","Mary","Robert","Robert","Mary",
"john","john","Robert","Robert"),
product = c("a","b","d","a","b","c",
"d","a","b","c","a","c","c","d")
date customer product
1 jan john a
2 jan john b
3 jan john d
4 jan Mary a
5 jan Mary b
6 jan Mary c
7 jan Mary d
8 feb Robert a
9 feb Robert b
10 feb Mary c
11 feb john a
12 feb john c
13 feb Robert c
14 feb Robert d
I need to summarize how many times the same customer is present across months and products.
Expected result:
date a b c d same cust
jan 2 2 1 2 0
feb 2 1 2 0 1
same cust 1 0 1 0
A possible solution:
library(tidyverse)
df <-
data.frame(
stringsAsFactors = FALSE,
date = c("jan","jan","jan","jan",
"jan","jan","jan","feb","feb","feb","feb","feb",
"feb","feb"),
customer = c("john","john","john","Mary",
"Mary","Mary","Mary","Robert","Robert","Mary",
"john","john","Robert","Robert"),
product = c("a","b","d","a","b","c",
"d","a","b","c","a","c","c","d"))
df %>%
pivot_wider(date,names_from=product,values_from=customer,values_fn=length)%>%
bind_cols(SCust = table(df$customer, df$date) %>% apply(2, \(x) sum(x>=2))) %>%
bind_rows(c(tibble(date="SCust"),
table(df$customer, df$product) %>% apply(2, \(x) sum(x>=2))))
#> # A tibble: 3 × 6
#> date a b d c SCust
#> <chr> <int> <int> <int> <int> <int>
#> 1 jan 2 2 2 1 2
#> 2 feb 2 1 1 3 2
#> 3 SCust 1 0 0 1 NA
I don't know about the marginals, but for the main table
library(reshape2)
dcast(
df,
date~product,
function(x){length(unique(x))},
value.var="customer"
)
date a b c d
1 feb 2 1 3 1
2 jan 2 2 1 2
You can try
library(tidyverse)
df %>%
pivot_wider(names_from = product, values_from = customer, values_fn = n_distinct) %>%
bind_rows(
df %>%
count(product, customer) %>%
group_by(product) %>%
summarise(n=sum(n-1),
date = "all") %>%
pivot_wider(names_from = product,values_from=n ))
# A tibble: 3 x 5
date a b d c
<chr> <dbl> <dbl> <dbl> <dbl>
1 jan 2 2 2 1
2 feb 2 1 1 3
3 all 1 0 0 1
dt <- data.frame(stringsAsFactors = FALSE,
date = c("jan","jan","jan","jan", "jan","jan","jan","feb","feb","feb","feb","feb","feb","feb"),
customer = c("john","john","john","Mary", "Mary","Mary","Mary","Robert","Robert","Mary","john","john","Robert","Robert"),
product = c("a","b","d","a","b","c","d","a","b","c","a","c","c","d")
)
library(data.table)
setDT(dt)
setorder(dt, product)
rbindlist(list(
dcast(dt[, .(value = .N), by = .(date, product)], date ~ product),
transpose(dt[, .(same_cust_row = .N - length(unique(customer))), by = .(product)], make.names = "product", keep.names = "date")
))
# date a b c d
# 1: feb 2 1 3 1
# 2: jan 2 2 1 2
# 3: same_cust_row 1 0 1 0
Do you need the "detail" data, or just the summary ("same cust") data?
library(dplyr)
library(tidyr)
library(purrr)
# by month / same customer bought in both months
df %>% pivot_wider(names_from = product, values_from = date, values_fn = length) %>%
select(-customer) %>%
map( ~ sum(.x==2))
$a
[1] 1
$b
[1] 0
$d
[1] 0
$c
[1] 1
# by month / same customer bought all (4) products
z <- df %>% pivot_wider(names_from = date, values_from = product, values_fn = length) %>%
select(-customer) %>%
map( ~ sum(.x==4))
$jan
[1] NA
$feb
[1] 1
I have a df of the form:
df <- tibble(
id = c(1,2,3),
val02 = c(0,1,0),
val03 = c(1,0,0),
val04 = c(0,1,1),
age02 = c(1,2,3),
age03 = c(2,3,4),
age04 = c(3,4,5)
)
I want to bring it into tidy format like:
# A tibble: 9 x 4
id year val age
<dbl> <chr> <dbl> <dbl>
1 1 02 0 1
2 1 03 1 2
3 1 04 0 3
4 2 02 1 2
5 2 03 0 3
6 2 04 1 4
7 3 02 0 3
8 3 03 0 4
9 3 04 1 5
Using two seperate pivot_longer manipulations with a left_join at the end I achieved what I want:
library(tidyverse)
df1 <- df %>%
pivot_longer(cols = starts_with("val"), names_to = "year", values_to = "val", names_prefix = "val")
df2 <- df %>%
pivot_longer(cols = starts_with("age"), names_to = "year", values_to = "age", names_prefix = "age")
left_join(df1, df2) %>%
select(id, year, val, age)
This, however, seems utterly complicated.
How can I simplify this operation? Is there a way to perform this operation in one go? (in one pipe..?)
This depends on the complexity of your strings (column names), but to give an idea:
library(tidyverse)
df %>%
pivot_longer(-id,
names_to = c('.value', 'year'),
names_pattern = '([a-z]+)(\\d+)'
)
Output:
# A tibble: 9 x 4
id year val age
<dbl> <chr> <dbl> <dbl>
1 1 02 0 1
2 1 03 1 2
3 1 04 0 3
4 2 02 1 2
5 2 03 0 3
6 2 04 1 4
7 3 02 0 3
8 3 03 0 4
9 3 04 1 5
Sorry if this post is not well organized, first time stack overflower...
I am trying to create a column to create a order within each IDs, but the twist is that if there is a gap year, order needs to start from the beginning.
Please check example and expected result below.
I wasn't able to find appropriate code for it.. I cannot think of anything :( Please help me! I appreciate alot!
One option is to create a new group variable when difference between the year is greater than 1 and create a sequence in each group using row_number().
library(dplyr)
df %>%
group_by(ID, group = cumsum(c(1, diff(Year) > 1))) %>%
mutate(order = row_number()) %>%
ungroup() %>%
select(-group)
# ID Year order
# <fct> <int> <int>
# 1 A 2007 1
# 2 A 2008 2
# 3 A 2009 3
# 4 A 2013 1
# 5 A 2014 2
# 6 A 2015 3
# 7 A 2016 4
# 8 B 2010 1
# 9 B 2012 1
#10 B 2013 2
Using base R ave that would be
as.integer(with(df, ave(ID, ID, cumsum(c(1, diff(Year) > 1)), FUN = seq_along)))
#[1] 1 2 3 1 2 3 4 1 1 2
data
df <- data.frame(ID = c(rep("A", 7), rep("B", 3)),
Year = c(2007:2009, 2013:2016, 2010, 2012, 2013), stringsAsFactors = FALSE)
A data.table option:
library(data.table)
setDT(df)
df[, jump := Year - shift(Year) - 1, by = ID
][is.na(jump), jump := 0
][, order := seq_len(.N), by = .(ID, cumsum(jump))]
# ID Year jump order
# 1: A 2007 0 1
# 2: A 2008 0 2
# 3: A 2009 0 3
# 4: A 2013 3 1
# 5: A 2014 0 2
# 6: A 2015 0 3
# 7: A 2016 0 4
# 8: B 2010 0 1
# 9: B 2012 1 1
# 10: B 2013 0 2
Or using data.table::nafill() available in data.table v1.12.3 (still in development):
df[, jump := nafill(Year - shift(Year) - 1, fill = 0), by = ID
][, order := seq_len(.N), by = .(ID, cumsum(jump))]
We can take the difference of 'Year' and the lag of 'Year', get the cumulative sum, use that in the group_by along with 'ID' and create the order as row_number()
library(dplyr)
df %>%
group_by(ID, grp = cumsum(Year - lag(Year, default = Year[1]) > 1)) %>%
mutate(order = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 10 x 3
# ID Year order
# <chr> <dbl> <int>
# 1 A 2007 1
# 2 A 2008 2
# 3 A 2009 3
# 4 A 2013 1
# 5 A 2014 2
# 6 A 2015 3
# 7 A 2016 4
# 8 B 2010 1
# 9 B 2012 1
#10 B 2013 2
data
df <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "B",
"B", "B"), Year = c(2007, 2008, 2009, 2013, 2014, 2015, 2016,
2010, 2012, 2013)), class = "data.frame", row.names = c(NA, -10L
))