Related
I have a dataframe that looks like
id = c("1", "2", "3")
IN1999 = c(1, 1, 0)
IN2000 = c(1, 0, 1)
TEST1999 = c(10, 12, NA)
TEST2000 = c(15, NA, 11)
df <- data.frame(id, IN1999, IN2000, TEST1999, TEST2000)
I am trying to use pivot_longer to change it into this form:
id year IN TEST
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11
My current code looks like this
df %>%
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = 4)
but obviousely by setting names_sep = 4, r cuts IN1999 and IN2000 at the wrong place. How can I set the argument so that r can separate the column name from the last four digits?
The names_sep-argument in pivot_longer also accepts regex expressions, that will allow you to split before the occurrence of four digits as in this example below:
library(tidyr)
df |>
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = "(?=\\d{4})")
Output:
# A tibble: 6 × 4
id year IN TEST
<chr> <chr> <dbl> <dbl>
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11
I'm trying to calculate the cumulative time among several grades.
Here's how my original df looks like:
df = data.frame(id = c(1,1,1,1,2,2,2,2),
group = c(0,0,0,0,1,1,1,1),
grade = c(0,1,2,3,0,1,3,4),
time = c(10,7,4,1,20,17,14,11))
Here's what I'm expecting as the result df1:
df1 = df %>%
pivot_wider(
names_from = "grade",
names_prefix = "grade_",
values_from = "time") %>%
replace(is.na(.), 0) %>%
mutate(grade_1 = grade_1 + grade_2 + grade_3 + grade_4,
grade_2 = grade_2 + grade_3 + grade_4,
grade_3 = grade_3 + grade_4) %>%
pivot_longer(
cols = 3:7,
names_to = "grade",
names_prefix = "grade_",
values_to = "time")
My method works, but I want it to be more flexible. When I have more grades in the df, I don't need to manually add grade_x = grade_1 + grade_2 + grade_3 ...
Thank you!
One option would be to rearrange the grade column, then do cumsum so that it is in reverse. However, we exclude the last row, where grade == 0. Then, we can re-arrange back in the desired order and ungroup.
library(tidyverse)
results <- df %>%
group_by(id) %>%
arrange(id, desc(grade)) %>%
mutate(time = ifelse(row_number()!=n(), cumsum(time), time)) %>%
arrange(id, grade) %>%
ungroup
Output
id group grade time
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 10
2 1 0 1 12
3 1 0 2 5
4 1 0 3 1
5 2 1 0 20
6 2 1 1 42
7 2 1 3 25
8 2 1 4 11
If you need each group to have the same number of rows as in your desired output, then you can use complete:
df %>%
tidyr::complete(id, grade) %>%
group_by(id) %>%
fill(group, .direction ="downup") %>%
replace(is.na(.), 0) %>%
arrange(id, desc(grade)) %>%
mutate(time = ifelse(row_number()!=n(), cumsum(time), time)) %>%
arrange(id, grade) %>%
ungroup
Output
id grade group time
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 10
2 1 1 0 12
3 1 2 0 5
4 1 3 0 1
5 1 4 0 0
6 2 0 1 20
7 2 1 1 42
8 2 2 1 25
9 2 3 1 25
10 2 4 1 11
Or if you want to pivot back and forth then you could do something like this:
output <- df %>%
pivot_wider(
names_from = "grade",
names_prefix = "grade_",
values_from = "time") %>%
replace(is.na(.), 0) %>%
select(id, group, grade_0, last_col():grade_1)
results2 <- output %>%
select(-c(id, group, grade_0)) %>%
rowwise()%>%
do(data.frame(t(cumsum(unlist(.))))) %>%
bind_cols(select(output, id, group, grade_0), .) %>%
pivot_longer(
cols = 3:7,
names_to = "grade",
names_prefix = "grade_",
values_to = "time")
1st Try:
for cumulative sums across a variable, we can group_by and use cumsum() :
No need to specify grades, etc. You can do more aggregations if needed.
df%>%
group_by(grade)%>%
mutate(Cum_Time = cumsum(time))%>%arrange(grade)
id group grade time Cum_Time
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 10
2 2 1 0 20 30
3 1 0 1 7 7
4 2 1 1 17 24
5 1 0 2 4 4
6 1 0 3 1 1
7 2 1 3 14 15
8 2 1 4 11 11
I've the following table.
Date
Cat
15/2/1999
A
15/2/1999
A
15/2/1999
B
15/5/1999
A
15/5/1999
B
15/10/1999
C
15/10/1999
C
15/2/2001
A
15/2/2001
A
15/6/2001
B
15/6/2001
B
15/6/2001
C
15/11/2001
C
15/11/2001
C
I would like to apply pivot_wider (or any other similar functions) to it and also accounting for the Date and Year column as seen below. The Cat column is being split based on the variable A, B and C and the count is being displayed.
Month
Year
A
B
C
Total
February
1999
2
1
0
3
May
1999
1
1
0
2
October
1999
0
0
2
2
February
2001
2
0
0
2
June
2001
0
2
1
3
November
2001
0
0
2
2
Does anyone here knows how I can do both together? Thanks
You can do this with tidyverse packages. First, format your date column as date, then count by month, pivot to wider and format the table.
library(tidyverse)
data %>%
mutate(Date = as.Date(Date, format = "%d/%m/%Y")) %>%
group_by(Cat, month = lubridate::floor_date(Date, "month")) %>%
count(Cat) %>%
pivot_wider(names_from = Cat, values_from = n, values_fill = 0) %>%
mutate(year = year(month), .before = "A",
month = month(month, label = T, abbr = F)) %>%
mutate(Total = rowSums(across(A:C))) %>%
arrange(year)
month year A B C Total
<ord> <dbl> <int> <int> <int> <dbl>
1 February 1999 2 1 0 3
2 May 1999 1 1 0 2
3 October 1999 0 0 2 2
4 February 2001 2 0 0 2
5 June 2001 0 2 1 3
6 November 2001 0 0 2 2
data
data <- structure(list(Date = c("15/2/1999", "15/2/1999", "15/2/1999",
"15/5/1999", "15/5/1999", "15/10/1999", "15/10/1999", "15/2/2001",
"15/2/2001", "15/6/2001", "15/6/2001", "15/6/2001", "15/11/2001",
"15/11/2001"), Cat = c("A", "A", "B", "A", "B", "C", "C", "A",
"A", "B", "B", "C", "C", "C")), class = "data.frame", row.names = c(NA,
-14L))
Another possible solution:
library(tidyverse)
library(lubridate)
df <- data.frame(
stringsAsFactors = FALSE,
Date = c("15/2/1999",
"15/2/1999","15/2/1999","15/5/1999","15/5/1999",
"15/10/1999","15/10/1999","15/2/2001","15/2/2001",
"15/6/2001","15/6/2001","15/6/2001","15/11/2001",
"15/11/2001"),
Cat = c("A","A","B","A",
"B","C","C","A","A","B","B","C","C","C")
)
df %>%
mutate(Month = month(Date, label = TRUE), Year = year(dmy(Date))) %>%
pivot_wider(id_cols = c(Month, Year), names_from = Cat,
values_from = Cat, values_fn = length, values_fill = 0) %>%
mutate(Total = rowSums(.[3:5]))
#> # A tibble: 6 × 6
#> Month Year A B C Total
#> <ord> <dbl> <int> <int> <int> <dbl>
#> 1 Feb 1999 2 1 0 3
#> 2 May 1999 1 1 0 2
#> 3 Oct 1999 0 0 2 2
#> 4 Feb 2001 2 0 0 2
#> 5 Jun 2001 0 2 1 3
#> 6 Nov 2001 0 0 2 2
having a dataframe with sales per customer and months.
df <-
data.frame(
stringsAsFactors = FALSE,
date = c("jan","jan","jan","jan",
"jan","jan","jan","feb","feb","feb","feb","feb",
"feb","feb"),
customer = c("john","john","john","Mary",
"Mary","Mary","Mary","Robert","Robert","Mary",
"john","john","Robert","Robert"),
product = c("a","b","d","a","b","c",
"d","a","b","c","a","c","c","d")
date customer product
1 jan john a
2 jan john b
3 jan john d
4 jan Mary a
5 jan Mary b
6 jan Mary c
7 jan Mary d
8 feb Robert a
9 feb Robert b
10 feb Mary c
11 feb john a
12 feb john c
13 feb Robert c
14 feb Robert d
I need to summarize how many times the same customer is present across months and products.
Expected result:
date a b c d same cust
jan 2 2 1 2 0
feb 2 1 2 0 1
same cust 1 0 1 0
A possible solution:
library(tidyverse)
df <-
data.frame(
stringsAsFactors = FALSE,
date = c("jan","jan","jan","jan",
"jan","jan","jan","feb","feb","feb","feb","feb",
"feb","feb"),
customer = c("john","john","john","Mary",
"Mary","Mary","Mary","Robert","Robert","Mary",
"john","john","Robert","Robert"),
product = c("a","b","d","a","b","c",
"d","a","b","c","a","c","c","d"))
df %>%
pivot_wider(date,names_from=product,values_from=customer,values_fn=length)%>%
bind_cols(SCust = table(df$customer, df$date) %>% apply(2, \(x) sum(x>=2))) %>%
bind_rows(c(tibble(date="SCust"),
table(df$customer, df$product) %>% apply(2, \(x) sum(x>=2))))
#> # A tibble: 3 × 6
#> date a b d c SCust
#> <chr> <int> <int> <int> <int> <int>
#> 1 jan 2 2 2 1 2
#> 2 feb 2 1 1 3 2
#> 3 SCust 1 0 0 1 NA
I don't know about the marginals, but for the main table
library(reshape2)
dcast(
df,
date~product,
function(x){length(unique(x))},
value.var="customer"
)
date a b c d
1 feb 2 1 3 1
2 jan 2 2 1 2
You can try
library(tidyverse)
df %>%
pivot_wider(names_from = product, values_from = customer, values_fn = n_distinct) %>%
bind_rows(
df %>%
count(product, customer) %>%
group_by(product) %>%
summarise(n=sum(n-1),
date = "all") %>%
pivot_wider(names_from = product,values_from=n ))
# A tibble: 3 x 5
date a b d c
<chr> <dbl> <dbl> <dbl> <dbl>
1 jan 2 2 2 1
2 feb 2 1 1 3
3 all 1 0 0 1
dt <- data.frame(stringsAsFactors = FALSE,
date = c("jan","jan","jan","jan", "jan","jan","jan","feb","feb","feb","feb","feb","feb","feb"),
customer = c("john","john","john","Mary", "Mary","Mary","Mary","Robert","Robert","Mary","john","john","Robert","Robert"),
product = c("a","b","d","a","b","c","d","a","b","c","a","c","c","d")
)
library(data.table)
setDT(dt)
setorder(dt, product)
rbindlist(list(
dcast(dt[, .(value = .N), by = .(date, product)], date ~ product),
transpose(dt[, .(same_cust_row = .N - length(unique(customer))), by = .(product)], make.names = "product", keep.names = "date")
))
# date a b c d
# 1: feb 2 1 3 1
# 2: jan 2 2 1 2
# 3: same_cust_row 1 0 1 0
Do you need the "detail" data, or just the summary ("same cust") data?
library(dplyr)
library(tidyr)
library(purrr)
# by month / same customer bought in both months
df %>% pivot_wider(names_from = product, values_from = date, values_fn = length) %>%
select(-customer) %>%
map( ~ sum(.x==2))
$a
[1] 1
$b
[1] 0
$d
[1] 0
$c
[1] 1
# by month / same customer bought all (4) products
z <- df %>% pivot_wider(names_from = date, values_from = product, values_fn = length) %>%
select(-customer) %>%
map( ~ sum(.x==4))
$jan
[1] NA
$feb
[1] 1
I have a df of the form:
df <- tibble(
id = c(1,2,3),
x02val_a = c(0,1,0),
x03val_a = c(1,0,0),
x04val_a = c(0,1,1),
x02val_b = c(0,2,0),
x03val_b = c(1,3,0),
x04val_b = c(0,1,2),
age02 = c(1,2,3),
age03 = c(2,3,4),
age04 = c(3,4,5)
)
I want to bring it into tidy format like:
# A tibble: 9 x 5
id year val_a val_b age
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 02 0 0 1
2 1 03 1 2 2
...
The answer from here worked for simpler naming schemes. With the naming scheme present in my real dataset, however, I struggle to define a regex that matches all patterns.
My attempts so far all missed one or the other schemes. I can grab the one with the variable name first and the year last (age02) or the one with the type and year first and the name last (x02var) but not both at the same time.
Is there a way to do this with a) a regex? or b) some combinations or parameterizations of the pivot_longer call(s)?
I know there is always the possibility to do it with a left join at the end as I described here
I tried to define the regex with two groups inside each other (since the groups are not strictly serial [meaning: left, right], which led me to):
df %>%
pivot_longer(-id,names_to = c('.value', 'year'),names_pattern = '([a-z]+(\\d+)[a-z]+_[a-z])')
Let's try. It seems this name pattern works:
> df %>%
pivot_longer(-id,
names_to = c('.value', 'year','.value'),
names_pattern = '([a-z]+)(\\d+)([a-z_]*)')
# A tibble: 9 x 5
id year xval_a xval_b age
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 02 0 0 1
2 1 03 1 1 2
3 1 04 0 0 3
4 2 02 1 2 2
5 2 03 0 3 3
6 2 04 1 1 4
7 3 02 0 0 3
8 3 03 0 0 4
9 3 04 1 2 5
It's a bit roundabout, but because of the inconsistent name style, you might first rename your columns to match an easier pattern. There are 3 possible pieces of information in your names, but (at least in your example) each column has only 2 of these.
The relevant pieces are:
Multiple continuous matches to "[a-z_]", which either occurs after "x" or after the 2 digits. Whichever of these is present will get moved to the beginning of the name; whichever is not present will just not return anything and not take up any space.
2 digits, which get moved to the end.
The parameterization possible with pivot_longer's ".value" option gives you column names in just one step based on this cleaner pattern. Should be trivial enough to adjust the pattern as needed, e.g. to fit a different number of digits.
library(dplyr)
library(tidyr)
df %>%
rename_all(stringr::str_replace, "x?([a-z_]*)(\\d{2})([a-z_]*)", "\\1\\3\\2") %>%
pivot_longer(-id, names_to = c(".value", "year"), names_pattern = "([a-z_]+)(\\d{2})")
#> # A tibble: 9 x 5
#> id year val_a val_b age
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 02 0 0 1
#> 2 1 03 1 1 2
#> 3 1 04 0 0 3
#> 4 2 02 1 2 2