Retaining all columns in `tidyr::pivot_wider()` output - r

I am trying to convert a data in long format to wide format using tidyr::pivot_wider() but running into problems.
Data
Let's say this is my example dataset
library(dplyr)
library(tidyr)
(dataEx <- structure(
list(
random1 = c(10, 10, 10, 10, 10, 10),
random2 = c(1, 1, 2, 2, 3, 3),
.rowid = c(1L, 1L, 2L, 2L, 3L, 3L),
Variable = c("x", "y", "x", "y", "x", "y"),
Dimension = c("Time", "Fraction", "Time", "Fraction", "Time", "Fraction"),
Unit = c("s", "%", "s", "%", "s", "%"),
Values = c(900, 25, 1800, 45, 3600, 78)
),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")
))
#> # A tibble: 6 x 7
#> random1 random2 .rowid Variable Dimension Unit Values
#> <dbl> <dbl> <int> <chr> <chr> <chr> <dbl>
#> 1 10 1 1 x Time s 900
#> 2 10 1 1 y Fraction % 25
#> 3 10 2 2 x Time s 1800
#> 4 10 2 2 y Fraction % 45
#> 5 10 3 3 x Time s 3600
#> 6 10 3 3 y Fraction % 78
Actual output
And here is what I currently have to pivot it to wider format. And, although it works, note that it drops two columns: random1 and random2.
dataEx %>%
tidyr::pivot_wider(
id_cols = .rowid,
names_from = Variable,
values_from = dplyr::matches("Values|Unit|Dimension"),
names_glue = "{Variable}{.value}"
)
#> # A tibble: 3 x 7
#> .rowid xDimension yDimension xUnit yUnit xValues yValues
#> <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 1 Time Fraction s % 900 25
#> 2 2 Time Fraction s % 1800 45
#> 3 3 Time Fraction s % 3600 78
Expected output
How can I avoid this from happening, so that I get the following (expected) output?
#> # A tibble: 3 x 9
#> .rowid xDimension yDimension xUnit yUnit xValues yValues random1 random2
#> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Time Fraction s % 900 25 10 1
#> 2 2 Time Fraction s % 1800 45 10 2
#> 3 3 Time Fraction s % 3600 78 10 3

Add more column names to the id_cols argument:
id_cols = c(.rowid, random1, random2)

Related

R Join without duplicates

Currently when joining two datasets (of different years) I get duplicates of the second one when there are less observations in the second one than the first.
Below, ID 1 only has 1 observation in year y, but it gets repeated because the first dataset of year x has three observations in total. I don't want the duplicates, but simply NAs.
So what I currently get is this:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 A 2
3 1 C 1 A 2
What I want is:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 NA NA
3 1 C 1 NA NA
The end result is that my manager can tell in year x customer 1 ordered A, B, C in n.x quantities. In year y they only ordered A in n.y quantities.
Data:
structure(list(ID = c(1, 1, 1), Value = c("A", "B", "C"), N = c(6,
7, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
structure(list(ID = 1, Value = "A", N = 2), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -1L))
I would do it like this:
merge(tbl_df1, tbl_df2, by = c("ID", "Value"), all = TRUE)

Extract data based on time to death

Hi I'm analysing the pattern of spending for individuals before they died. My dataset contains individuals' monthly spending and their dates of death. The dataset looks similar to this:
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
Each column denotes the month of the year. For example, "2018_11" means November 2018. The number in each cell denotes the spending in that specific month.
I would like to construct a data frame which contains the spending data of each individual in their last 0-12 months. It will look like this:
ID last_12_month last_11_month ...... last_1_month last_0_month date_of_death
A 6 23 30 1 2020-01-02
B 2 5 30 31 2019-11-15
Each individual died at different time. For example, individual A died on 2020-01-02, so the data of the "last_0_month" for this person should be extracted from the column "2020_01", and that of "last_12_month" extracted from "2019_01"; individual B died on 2019-11-15, so the data of "last_0_month" for this person should be extracted from the column "2019_11", and that of "last_12_month" should be extracted from the column "2018_11".
I will be really grateful for your help.
Using data.table and lubridate packages
library(data.table)
library(lubridate)
setDT(dt)
dt <- melt(dt, id.vars = c("ID", "date_of_death"))
dt[, since_death := interval(ym(variable), ymd(date_of_death)) %/% months(1)]
dt <- dcast(dt[since_death %between% c(0, 12)], ID + date_of_death ~ since_death, value.var = "value", fun.aggregate = sum)
setcolorder(dt, c("ID", "date_of_death", rev(names(dt)[3:15])))
setnames(dt, old = names(dt)[3:15], new = paste("last", names(dt)[3:15], "month", sep = "_"))
Results
dt
# ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month
# 1: A 2020-01-02 6 23 23 5 6 30 1 15 6 7
# 2: B 2019-11-15 2 5 6 7 7 8 9 15 12 14
# last_2_month last_1_month last_0_month
# 1: 8 30 1
# 2: 31 30 31
Data
dt <- structure(list(ID = c("A", "B"), `2018_11` = c(15L, 2L), `2018_12` = c(14L,
5L), `2019_01` = c(6L, 6L), `2019_02` = c(23L, 7L), `2019_03` = c(23L,
7L), `2019_04` = c(5L, 8L), `2019_05` = c(6L, 9L), `2019_06` = c(30L,
15L), `2019_07` = c(1L, 12L), `2019_08` = 15:14, `2019_09` = c(6L,
31L), `2019_10` = c(7L, 30L), `2019_11` = c(8L, 31L), `2019_12` = c(30L,
0L), `2020_01` = 1:0, date_of_death = structure(c(18263L, 18215L
), class = c("IDate", "Date"))), row.names = c(NA, -2L), class = c("data.frame"))
here you can find a similar approach to the one presented by #RuiBarradas but using lubridate for extracting the difference in months:
library(dplyr)
library(tidyr)
library(lubridate)
# Initial data
df <- structure(list(
ID = c("A", "B"),
`2018_11` = c(15, 2),
`2018_12` = c(14, 5),
`2019_01` = c(6, 6),
`2019_02` = c(23, 7),
`2019_03` = c(23, 7),
`2019_04` = c(5, 8),
`2019_05` = c(6, 9),
`2019_06` = c(30, 15),
`2019_07` = c(1, 12),
`2019_08` = c(15, 14),
`2019_09` = c(6, 31),
`2019_10` = c(7, 30),
`2019_11` = c(8, 31),
`2019_12` = c(30, 0),
`2020_01` = c(1, 0),
date_of_death = c("2020-01-02", "2019-11-15")
),
row.names = c(NA, -2L),
class = "data.frame"
)
# Convert to longer all cols that start with 20 (e.g. 2020, 2021)
df_long <- df %>%
pivot_longer(starts_with("20"), names_to = "month")
# treatment
df_long <- df_long %>%
mutate(
# To date, just in case
date_of_death = as.Date(date_of_death),
# Need to reformat the colnames from (e.g.) 2021_01 to 2021-01-01
month_fmt = as.Date(paste0(gsub("_", "-", df_long$month), "-01")),
# End of month
month_fmt = ceiling_date(month_fmt, "month") - days(1),
# End of month for month of death
date_of_death_eom = ceiling_date(date_of_death, "month") - days(1),
# Difference in months (using end of months
month_diff = round(time_length(
interval(month_fmt, date_of_death_eom),"month"),0)) %>%
# Select only months bw 0 and 12
filter(month_diff %in% 0:12) %>%
# Create labels for the next step
mutate(labs = paste0("last_", month_diff,"_month"))
# To wider
end <- df_long %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = labs,
values_from = value
)
end
#> # A tibble: 2 x 15
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month
#> <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2020-01-02 6 23 23 5
#> 2 B 2019-11-15 2 5 6 7
#> # ... with 9 more variables: last_8_month <dbl>, last_7_month <dbl>,
#> # last_6_month <dbl>, last_5_month <dbl>, last_4_month <dbl>,
#> # last_3_month <dbl>, last_2_month <dbl>, last_1_month <dbl>,
#> # last_0_month <dbl>
Created on 2022-03-09 by the reprex package (v2.0.1)
Here is a tidyverse solution.
Reshape the data to long format, coerce the date columns to class "Date", use Dirk Eddelbuettel's accepted answer to this question to compute the date differences in months and keep the rows with month differences between 0 and 12.
This grouped long format is probably more useful and I compute means by group and plot the spending of the last 12 months prior to death but since the question asks for a wide format, the output data set spending12_wide is created.
options(width=205)
df1 <- read.table(text = "
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(ggplot2)
# Dirk's functions
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin = "1900-01-01"))
lt$year*12 + lt$mon
}
# compute a month difference as a difference between two monnb's
diffmon <- function(d1, d2) { monnb(d2) - monnb(d1) }
spending12 <- df1 %>%
pivot_longer(cols = starts_with('20'), names_to = "month") %>%
mutate(month = as.Date(paste0(month, "_01"), "%Y_%m_%d"),
date_of_death = as.Date(date_of_death)) %>%
group_by(ID, date_of_death) %>%
mutate(diffm = diffmon(month, date_of_death)) %>%
filter(diffm >= 0 & diffm <= 12)
spending12 %>% summarise(spending = mean(value), .groups = "drop")
#> # A tibble: 2 x 3
#> ID date_of_death spending
#> <chr> <date> <dbl>
#> 1 A 2020-01-02 12.4
#> 2 B 2019-11-15 13.6
spending12_wide <- spending12 %>%
mutate(month = zoo::as.yearmon(month)) %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = diffm,
names_glue = "last_{.name}_month",
values_from = value
)
spending12_wide
#> # A tibble: 2 x 15
#> # Groups: ID, date_of_death [2]
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month last_2_month last_1_month last_0_month
#> <chr> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 A 2020-01-02 6 23 23 5 6 30 1 15 6 7 8 30 1
#> 2 B 2019-11-15 2 5 6 7 7 8 9 15 12 14 31 30 31
ggplot(spending12, aes(month, value, color = ID)) +
geom_line() +
geom_point()
Created on 2022-03-09 by the reprex package (v2.0.1)

Calculate difference between rows in R based on a specifc row for each group

Hi everyone,
I have a dataframe with where each ID has multiple visits from 1-5. I am trying to calculate the difference of a score between each visit to visit 1. eg. (Score(Visit 5-score(Visit1) and so on). How do I achieve that in R ? Below is a sample dataset and result dataset
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B"),
Visit = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), Score = c(16,
15, 13, 12, 12, 20, 19, 18)), class = "data.frame", row.names = c(NA,
-8L))
#> ID Visit Score
#> 1 A 1 16
#> 2 A 2 15
#> 3 A 3 13
#> 4 A 4 12
#> 5 A 5 12
#> 6 B 1 20
#> 7 B 2 19
#> 8 B 3 18
Created on 2021-05-20 by the reprex package (v2.0.0)
Here is the expected output
Here's a solution using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Difference = ifelse(Visit == 1, NA, Score[Visit == 1] - Score))
# A tibble: 8 x 4
# Groups: ID [2]
ID Visit Score Difference
<chr> <int> <dbl> <dbl>
1 A 1 16 NA
2 A 2 15 1
3 A 3 13 3
4 A 4 12 4
5 A 5 12 4
6 B 1 20 NA
7 B 2 19 1
8 B 3 18 2
Sample data
df <- data.frame(
ID = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'),
Visit = c(1:5, 1:3),
Score = c(16,15,13,12,12,20,19,18)
)
Sidenote: next time I suggest you to post not images but a sample data using the dput() function on your dataframe
Solution with dplyr using first
data <- data.frame(
ID = c(rep("A", 5), rep("B", 3)),
Visit = c(1:5, 1:3),
Score = c(16, 15, 13, 12, 12, 20, 19, 18))
library(dplyr)
data %>%
group_by(ID) %>%
arrange(Visit) %>%
mutate(Difference = first(Score) - Score)
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID Visit Score Difference
#> <chr> <int> <dbl> <dbl>
#> 1 A 1 16 0
#> 2 A 2 15 1
#> 3 A 3 13 3
#> 4 A 4 12 4
#> 5 A 5 12 4
#> 6 B 1 20 0
#> 7 B 2 19 1
#> 8 B 3 18 2
Created on 2021-05-20 by the reprex package (v2.0.0)

"Pivot longer" all columns in single-row data frame into two "values" columns

Title is complicated, but I don't know how to put this problem into words. So I'll demonstrate.
Here's my problem, with the desired output:
library(tibble)
# Input:
tribble(
~n_1, ~n_2, ~n_3, ~pct_1, ~pct_2, ~pct_3,
10, 20, 30, 0.1, 0.2, 0.3
)
#> # A tibble: 1 x 6
#> n_1 n_2 n_3 pct_1 pct_2 pct_3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 20 30 0.1 0.2 0.3
# Desired output:
tribble(
~name, ~n, ~pct,
1, 10, 0.1,
2, 20, 0.2,
3, 30, 0.3
)
#> # A tibble: 3 x 3
#> name n pct
#> <dbl> <dbl> <dbl>
#> 1 1 10 0.1
#> 2 2 20 0.2
#> 3 3 30 0.3
I tried tidyr::pivot_longer(), but I can't get it right. Is there any way?
One option could be:
df %>%
pivot_longer(everything(),
names_to = c(".value", "name"),
names_pattern = "(.*)_(.)")
name n pct
<chr> <dbl> <dbl>
1 1 10 0.1
2 2 20 0.2
3 3 30 0.3
Try this approach. As your main variable is concatenated you can use separate() (using sep='_') after pivot_longer() and then pivot_wider() to obtain the expected dataframe. Here the code:
library(tidyverse)
#Code
df %>% pivot_longer(cols = everything()) %>%
separate(name,into = c('var','name'),sep = '_') %>%
pivot_wider(names_from = var,values_from=value)
Output:
# A tibble: 3 x 3
name n pct
<chr> <dbl> <dbl>
1 1 10 0.1
2 2 20 0.2
3 3 30 0.3
Some data used (the one you provided):
#Data
df <- structure(list(n_1 = 10, n_2 = 20, n_3 = 30, pct_1 = 0.1, pct_2 = 0.2,
pct_3 = 0.3), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"))

calculate quantile for each group i dataframe and assign NA?

I made up this example to explain my question:
df= structure(list(group = structure(c(1L, 1L, 2L, 2L, 10L, 10L
), .Label = c("Eve", "ba", "De", "De","Mi", "C", "O", "W",
"as", "ras", "Cro", "ics"), class = "factor"), ds = c(8, 8,
1, 4, 4, 6), em = c(1, 3, 8,2, 7, 3)), row.names = c(74567L,
74568L, 74570L, 74576L, 74577L, 74578L), class = "data.frame")
I need for each group to assign all values of em and ds to NA
> quantile 90 = NA
< quantile 10 = NA
Here's a way to do it for each group and each numeric variable using dplyr and ifelse.
Having only a couple of samples per group makes it difficult to interpret the whole concept of quantiles, so the result you get very much depends on how you define a quantile. The type parameter allows you to specify the definition you are using. R defaults to type = 7:
library(dplyr)
df %>%
group_by(group) %>%
mutate(ds = ifelse(ds > quantile(ds, .9) | ds < quantile(ds, .1), NA, ds),
em = ifelse(em > quantile(em, .9) | em < quantile(em, .1), NA, em))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group ds em
#> <fct> <dbl> <lgl>
#> 1 Eve 8 NA
#> 2 Eve 8 NA
#> 3 ba NA NA
#> 4 ba NA NA
#> 5 ras NA NA
#> 6 ras NA NA
However, you can change this depending on your definition:
df %>%
group_by(group) %>%
mutate(ds = ifelse(ds > quantile(ds, .9, type = 1) |
ds < quantile(ds, .1, type = 1), NA, ds),
em = ifelse(em > quantile(em, .9, type = 1) |
em < quantile(em, .1, type = 1), NA, em))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group ds em
#> <fct> <dbl> <dbl>
#> 1 Eve 8 1
#> 2 Eve 8 3
#> 3 ba 1 8
#> 4 ba 4 2
#> 5 ras 4 7
#> 6 ras 6 3
Created on 2020-05-17 by the reprex package (v0.3.0)

Resources