Organize columns with numerical info in colnames via dplyr relocate - r

I have a large amount of annual data in a data frame that will only get larger. I would like to organize it, grouping columns according to the year, which is included in the column names.
Base data:
dput(dat)
structure(list(id = 1:2, quantity = 3:4, avg_2002 = 5:6, avg_2003 = 7:8,
avg_2020 = 9:10, rev_2002 = c(15L, 24L), rev_2003 = c(21L,
32L), rev_2020 = c(27L, 40L)), row.names = c(NA, -2L), class = "data.frame")
What I would like to do is have all of the columns with, say, "2002" in them organized together, followed by the "2003" columns and so on...I know that relocate in dplyr is a good way to do it so I did the following:
dat <- tibble(dat)
dat <- dat %>%
relocate(grep("2002$", colnames(dat), value = TRUE),
.before = grep("2003$", colnames(dat), value = TRUE)) %>%
relocate(grep("2003$", colnames(dat), value = TRUE),
.after = grep("2002$", colnames(dat), value = TRUE))
which produces the desired result for my toy dataset:
id quantity avg_2002 rev_2002 avg_2003 rev_2003 avg_2020 rev_2020
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 3 5 15 7 21 9 27
2 2 4 6 24 8 32 10 40
My question is this:
How do I generalize the code above so that I don't have to keep adding relocate statements ad nauseum?
Is there a better way to do this task without using dplyr::relocate?
Any suggestions are much appreciated. Thanks!

We may use select - extract the numeric part of the column names, order it and use that index in select to reorder
library(dplyr)
dat %>%
select(id, quantity, order(readr::parse_number(names(.)[-(1:2)])) + 2)
-output
# A tibble: 2 × 8
id quantity avg_2002 rev_2002 avg_2003 rev_2003 avg_2020 rev_2020
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 3 5 15 7 21 9 27
2 2 4 6 24 8 32 10 40

Related

Rearrangement columns of a table in R

I have the following table that I want to modify
Debt2017 Debt2018 Debt2019 Cash2017 Cash2018 Cash2019 Year Other
2 4 3 5 6 7 2018 x
3 8 9 7 9 9 2017 y
So that the result is the following
Debt Cash FLAG After Other
2 5 0 x
3 7 1 x
8 9 1 y
9 9 1 y|
Basically, I want to change the data so that I have the different years in different rows, eliminating the values for the year indicated in the column "Year" and adding a FLAG that tells me whether the data indicated in the row is from a previous (0) or following (1) year (with respect to the year indicated in the column "Year").
Furthermore, I also want to keep the column "Other".
Does anybody know how to do it in R?
library(dplyr)
library(tidyr)
df %>%
pivot_longer(Debt2017:Cash2019,
names_to = c(".value", "Year2"),
names_pattern = "(\\D+)(\\d+)") %>%
filter(Year != Year2) %>%
mutate(flag = +(Year2 > Year))
# # A tibble: 4 × 6
# Year Other Year2 Debt Cash flag
# <int> <chr> <chr> <int> <int> <int>
# 1 2018 x 2017 2 5 0
# 2 2018 x 2019 3 7 1
# 3 2017 y 2018 8 9 1
# 4 2017 y 2019 9 9 1
Data
df <- structure(list(Debt2017 = 2:3, Debt2018 = c(4L, 8L), Debt2019 = c(3L, 9L),
Cash2017 = c(5L, 7L), Cash2018 = c(6L, 9L), Cash2019 = c(7L, 9L),
Year = 2018:2017, Other = c("x", "y")), class = "data.frame", row.names = c(NA, -2L))

Extract data based on time to death

Hi I'm analysing the pattern of spending for individuals before they died. My dataset contains individuals' monthly spending and their dates of death. The dataset looks similar to this:
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
Each column denotes the month of the year. For example, "2018_11" means November 2018. The number in each cell denotes the spending in that specific month.
I would like to construct a data frame which contains the spending data of each individual in their last 0-12 months. It will look like this:
ID last_12_month last_11_month ...... last_1_month last_0_month date_of_death
A 6 23 30 1 2020-01-02
B 2 5 30 31 2019-11-15
Each individual died at different time. For example, individual A died on 2020-01-02, so the data of the "last_0_month" for this person should be extracted from the column "2020_01", and that of "last_12_month" extracted from "2019_01"; individual B died on 2019-11-15, so the data of "last_0_month" for this person should be extracted from the column "2019_11", and that of "last_12_month" should be extracted from the column "2018_11".
I will be really grateful for your help.
Using data.table and lubridate packages
library(data.table)
library(lubridate)
setDT(dt)
dt <- melt(dt, id.vars = c("ID", "date_of_death"))
dt[, since_death := interval(ym(variable), ymd(date_of_death)) %/% months(1)]
dt <- dcast(dt[since_death %between% c(0, 12)], ID + date_of_death ~ since_death, value.var = "value", fun.aggregate = sum)
setcolorder(dt, c("ID", "date_of_death", rev(names(dt)[3:15])))
setnames(dt, old = names(dt)[3:15], new = paste("last", names(dt)[3:15], "month", sep = "_"))
Results
dt
# ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month
# 1: A 2020-01-02 6 23 23 5 6 30 1 15 6 7
# 2: B 2019-11-15 2 5 6 7 7 8 9 15 12 14
# last_2_month last_1_month last_0_month
# 1: 8 30 1
# 2: 31 30 31
Data
dt <- structure(list(ID = c("A", "B"), `2018_11` = c(15L, 2L), `2018_12` = c(14L,
5L), `2019_01` = c(6L, 6L), `2019_02` = c(23L, 7L), `2019_03` = c(23L,
7L), `2019_04` = c(5L, 8L), `2019_05` = c(6L, 9L), `2019_06` = c(30L,
15L), `2019_07` = c(1L, 12L), `2019_08` = 15:14, `2019_09` = c(6L,
31L), `2019_10` = c(7L, 30L), `2019_11` = c(8L, 31L), `2019_12` = c(30L,
0L), `2020_01` = 1:0, date_of_death = structure(c(18263L, 18215L
), class = c("IDate", "Date"))), row.names = c(NA, -2L), class = c("data.frame"))
here you can find a similar approach to the one presented by #RuiBarradas but using lubridate for extracting the difference in months:
library(dplyr)
library(tidyr)
library(lubridate)
# Initial data
df <- structure(list(
ID = c("A", "B"),
`2018_11` = c(15, 2),
`2018_12` = c(14, 5),
`2019_01` = c(6, 6),
`2019_02` = c(23, 7),
`2019_03` = c(23, 7),
`2019_04` = c(5, 8),
`2019_05` = c(6, 9),
`2019_06` = c(30, 15),
`2019_07` = c(1, 12),
`2019_08` = c(15, 14),
`2019_09` = c(6, 31),
`2019_10` = c(7, 30),
`2019_11` = c(8, 31),
`2019_12` = c(30, 0),
`2020_01` = c(1, 0),
date_of_death = c("2020-01-02", "2019-11-15")
),
row.names = c(NA, -2L),
class = "data.frame"
)
# Convert to longer all cols that start with 20 (e.g. 2020, 2021)
df_long <- df %>%
pivot_longer(starts_with("20"), names_to = "month")
# treatment
df_long <- df_long %>%
mutate(
# To date, just in case
date_of_death = as.Date(date_of_death),
# Need to reformat the colnames from (e.g.) 2021_01 to 2021-01-01
month_fmt = as.Date(paste0(gsub("_", "-", df_long$month), "-01")),
# End of month
month_fmt = ceiling_date(month_fmt, "month") - days(1),
# End of month for month of death
date_of_death_eom = ceiling_date(date_of_death, "month") - days(1),
# Difference in months (using end of months
month_diff = round(time_length(
interval(month_fmt, date_of_death_eom),"month"),0)) %>%
# Select only months bw 0 and 12
filter(month_diff %in% 0:12) %>%
# Create labels for the next step
mutate(labs = paste0("last_", month_diff,"_month"))
# To wider
end <- df_long %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = labs,
values_from = value
)
end
#> # A tibble: 2 x 15
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month
#> <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2020-01-02 6 23 23 5
#> 2 B 2019-11-15 2 5 6 7
#> # ... with 9 more variables: last_8_month <dbl>, last_7_month <dbl>,
#> # last_6_month <dbl>, last_5_month <dbl>, last_4_month <dbl>,
#> # last_3_month <dbl>, last_2_month <dbl>, last_1_month <dbl>,
#> # last_0_month <dbl>
Created on 2022-03-09 by the reprex package (v2.0.1)
Here is a tidyverse solution.
Reshape the data to long format, coerce the date columns to class "Date", use Dirk Eddelbuettel's accepted answer to this question to compute the date differences in months and keep the rows with month differences between 0 and 12.
This grouped long format is probably more useful and I compute means by group and plot the spending of the last 12 months prior to death but since the question asks for a wide format, the output data set spending12_wide is created.
options(width=205)
df1 <- read.table(text = "
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(ggplot2)
# Dirk's functions
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin = "1900-01-01"))
lt$year*12 + lt$mon
}
# compute a month difference as a difference between two monnb's
diffmon <- function(d1, d2) { monnb(d2) - monnb(d1) }
spending12 <- df1 %>%
pivot_longer(cols = starts_with('20'), names_to = "month") %>%
mutate(month = as.Date(paste0(month, "_01"), "%Y_%m_%d"),
date_of_death = as.Date(date_of_death)) %>%
group_by(ID, date_of_death) %>%
mutate(diffm = diffmon(month, date_of_death)) %>%
filter(diffm >= 0 & diffm <= 12)
spending12 %>% summarise(spending = mean(value), .groups = "drop")
#> # A tibble: 2 x 3
#> ID date_of_death spending
#> <chr> <date> <dbl>
#> 1 A 2020-01-02 12.4
#> 2 B 2019-11-15 13.6
spending12_wide <- spending12 %>%
mutate(month = zoo::as.yearmon(month)) %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = diffm,
names_glue = "last_{.name}_month",
values_from = value
)
spending12_wide
#> # A tibble: 2 x 15
#> # Groups: ID, date_of_death [2]
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month last_2_month last_1_month last_0_month
#> <chr> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 A 2020-01-02 6 23 23 5 6 30 1 15 6 7 8 30 1
#> 2 B 2019-11-15 2 5 6 7 7 8 9 15 12 14 31 30 31
ggplot(spending12, aes(month, value, color = ID)) +
geom_line() +
geom_point()
Created on 2022-03-09 by the reprex package (v2.0.1)

Duplicate rows except for certain values in R. Want to turn those values into their own columns

I have a dataset which contains duplicate rows. Except these rows have 5 columns which don't hold the same value which prevents me from using distinct(). I was wondering if there was a way I could turn
Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1 2 3 7 9 10 11 4
1 2 3 8 10 11 12 5
Into this
Column1 Column2 Column3 Column4_a Column4_b Column5_a Column5_b Column6_a Column6_b Column7_a Column7_b
1 2 3 7 8 9 10 10 11 11 12
This can be close to what you want. Reshape data to long and then group by the variable and omit the duplicated rows. After that you can create the names for the future columns and reshape to wide. Here the code using tidyverse functions:
library(tidyverse)
#Code
newdf <- df %>%
pivot_longer(everything()) %>%
arrange(name) %>%
group_by(name) %>%
filter(!duplicated(value)) %>%
mutate(name=paste0(name,'_',row_number())) %>%
pivot_wider(names_from = name,values_from=value)
Output:
# A tibble: 1 x 13
Column1_1 Column2_1 Column3_1 Column4_1 Column4_2 Column5_1 Column5_2 Column6_1 Column6_2 Column7_1
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 2 3 7 8 9 10 10 11 11
# ... with 3 more variables: Column7_2 <int>, Column8_1 <int>, Column8_2 <int>
Some data used:
#Data
df <- structure(list(Column1 = c(1L, 1L), Column2 = c(2L, 2L), Column3 = c(3L,
3L), Column4 = 7:8, Column5 = 9:10, Column6 = 10:11, Column7 = 11:12,
Column8 = 4:5), class = "data.frame", row.names = c(NA, -2L
))
We could do this with pivot_wider after creating a sequence column
library(dplyr)
library(tidyr)
df %>%
mutate(rn = 1, rn2 = row_number()) %>%
pivot_wider(names_from = rn2, values_from = starts_with('Column'))

R function to paste information from different rows with a common column? [duplicate]

This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]

Add column based on other columns values

I am honest, I could come up with a decent title for this.
Basically, I have a dateframe:
ID Qty BasePrice Total
1 2 30 50
1 1 20 20
2 4 5 15
For each line I want to calculate the following:
Result = (Qty * BasePrice) - Total
Which is supposedly easy to do in R. However, I want to group the results by ID (sum them).
Sample Output:
ID Qty BasePrice Total Results
1 2 30 50 10
1 1 20 20 10
2 4 5 15 5
For instance, for ID=1, the values represent ((2*30)-50)+((1*20)-20)
Any idea on how can I achieve this?
Thanks!
We can do a group_by sum of the difference between the product of 'Qty', 'BasePrice' with 'Total'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Result = sum((Qty * BasePrice) - Total))
# A tibble: 3 x 5
# Groups: ID [2]
# ID Qty BasePrice Total Result
# <int> <int> <int> <int> <int>
#1 1 2 30 50 10
#2 1 1 20 20 10
#3 2 4 5 15 5
data
df1 <- structure(list(ID = c(1L, 1L, 2L), Qty = c(2L, 1L, 4L), BasePrice = c(30L,
20L, 5L), Total = c(50L, 20L, 15L)), class = "data.frame", row.names = c(NA,
-3L))

Resources