Rearrangement columns of a table in R - r

I have the following table that I want to modify
Debt2017 Debt2018 Debt2019 Cash2017 Cash2018 Cash2019 Year Other
2 4 3 5 6 7 2018 x
3 8 9 7 9 9 2017 y
So that the result is the following
Debt Cash FLAG After Other
2 5 0 x
3 7 1 x
8 9 1 y
9 9 1 y|
Basically, I want to change the data so that I have the different years in different rows, eliminating the values for the year indicated in the column "Year" and adding a FLAG that tells me whether the data indicated in the row is from a previous (0) or following (1) year (with respect to the year indicated in the column "Year").
Furthermore, I also want to keep the column "Other".
Does anybody know how to do it in R?

library(dplyr)
library(tidyr)
df %>%
pivot_longer(Debt2017:Cash2019,
names_to = c(".value", "Year2"),
names_pattern = "(\\D+)(\\d+)") %>%
filter(Year != Year2) %>%
mutate(flag = +(Year2 > Year))
# # A tibble: 4 × 6
# Year Other Year2 Debt Cash flag
# <int> <chr> <chr> <int> <int> <int>
# 1 2018 x 2017 2 5 0
# 2 2018 x 2019 3 7 1
# 3 2017 y 2018 8 9 1
# 4 2017 y 2019 9 9 1
Data
df <- structure(list(Debt2017 = 2:3, Debt2018 = c(4L, 8L), Debt2019 = c(3L, 9L),
Cash2017 = c(5L, 7L), Cash2018 = c(6L, 9L), Cash2019 = c(7L, 9L),
Year = 2018:2017, Other = c("x", "y")), class = "data.frame", row.names = c(NA, -2L))

Related

Extract data based on time to death

Hi I'm analysing the pattern of spending for individuals before they died. My dataset contains individuals' monthly spending and their dates of death. The dataset looks similar to this:
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
Each column denotes the month of the year. For example, "2018_11" means November 2018. The number in each cell denotes the spending in that specific month.
I would like to construct a data frame which contains the spending data of each individual in their last 0-12 months. It will look like this:
ID last_12_month last_11_month ...... last_1_month last_0_month date_of_death
A 6 23 30 1 2020-01-02
B 2 5 30 31 2019-11-15
Each individual died at different time. For example, individual A died on 2020-01-02, so the data of the "last_0_month" for this person should be extracted from the column "2020_01", and that of "last_12_month" extracted from "2019_01"; individual B died on 2019-11-15, so the data of "last_0_month" for this person should be extracted from the column "2019_11", and that of "last_12_month" should be extracted from the column "2018_11".
I will be really grateful for your help.
Using data.table and lubridate packages
library(data.table)
library(lubridate)
setDT(dt)
dt <- melt(dt, id.vars = c("ID", "date_of_death"))
dt[, since_death := interval(ym(variable), ymd(date_of_death)) %/% months(1)]
dt <- dcast(dt[since_death %between% c(0, 12)], ID + date_of_death ~ since_death, value.var = "value", fun.aggregate = sum)
setcolorder(dt, c("ID", "date_of_death", rev(names(dt)[3:15])))
setnames(dt, old = names(dt)[3:15], new = paste("last", names(dt)[3:15], "month", sep = "_"))
Results
dt
# ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month
# 1: A 2020-01-02 6 23 23 5 6 30 1 15 6 7
# 2: B 2019-11-15 2 5 6 7 7 8 9 15 12 14
# last_2_month last_1_month last_0_month
# 1: 8 30 1
# 2: 31 30 31
Data
dt <- structure(list(ID = c("A", "B"), `2018_11` = c(15L, 2L), `2018_12` = c(14L,
5L), `2019_01` = c(6L, 6L), `2019_02` = c(23L, 7L), `2019_03` = c(23L,
7L), `2019_04` = c(5L, 8L), `2019_05` = c(6L, 9L), `2019_06` = c(30L,
15L), `2019_07` = c(1L, 12L), `2019_08` = 15:14, `2019_09` = c(6L,
31L), `2019_10` = c(7L, 30L), `2019_11` = c(8L, 31L), `2019_12` = c(30L,
0L), `2020_01` = 1:0, date_of_death = structure(c(18263L, 18215L
), class = c("IDate", "Date"))), row.names = c(NA, -2L), class = c("data.frame"))
here you can find a similar approach to the one presented by #RuiBarradas but using lubridate for extracting the difference in months:
library(dplyr)
library(tidyr)
library(lubridate)
# Initial data
df <- structure(list(
ID = c("A", "B"),
`2018_11` = c(15, 2),
`2018_12` = c(14, 5),
`2019_01` = c(6, 6),
`2019_02` = c(23, 7),
`2019_03` = c(23, 7),
`2019_04` = c(5, 8),
`2019_05` = c(6, 9),
`2019_06` = c(30, 15),
`2019_07` = c(1, 12),
`2019_08` = c(15, 14),
`2019_09` = c(6, 31),
`2019_10` = c(7, 30),
`2019_11` = c(8, 31),
`2019_12` = c(30, 0),
`2020_01` = c(1, 0),
date_of_death = c("2020-01-02", "2019-11-15")
),
row.names = c(NA, -2L),
class = "data.frame"
)
# Convert to longer all cols that start with 20 (e.g. 2020, 2021)
df_long <- df %>%
pivot_longer(starts_with("20"), names_to = "month")
# treatment
df_long <- df_long %>%
mutate(
# To date, just in case
date_of_death = as.Date(date_of_death),
# Need to reformat the colnames from (e.g.) 2021_01 to 2021-01-01
month_fmt = as.Date(paste0(gsub("_", "-", df_long$month), "-01")),
# End of month
month_fmt = ceiling_date(month_fmt, "month") - days(1),
# End of month for month of death
date_of_death_eom = ceiling_date(date_of_death, "month") - days(1),
# Difference in months (using end of months
month_diff = round(time_length(
interval(month_fmt, date_of_death_eom),"month"),0)) %>%
# Select only months bw 0 and 12
filter(month_diff %in% 0:12) %>%
# Create labels for the next step
mutate(labs = paste0("last_", month_diff,"_month"))
# To wider
end <- df_long %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = labs,
values_from = value
)
end
#> # A tibble: 2 x 15
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month
#> <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2020-01-02 6 23 23 5
#> 2 B 2019-11-15 2 5 6 7
#> # ... with 9 more variables: last_8_month <dbl>, last_7_month <dbl>,
#> # last_6_month <dbl>, last_5_month <dbl>, last_4_month <dbl>,
#> # last_3_month <dbl>, last_2_month <dbl>, last_1_month <dbl>,
#> # last_0_month <dbl>
Created on 2022-03-09 by the reprex package (v2.0.1)
Here is a tidyverse solution.
Reshape the data to long format, coerce the date columns to class "Date", use Dirk Eddelbuettel's accepted answer to this question to compute the date differences in months and keep the rows with month differences between 0 and 12.
This grouped long format is probably more useful and I compute means by group and plot the spending of the last 12 months prior to death but since the question asks for a wide format, the output data set spending12_wide is created.
options(width=205)
df1 <- read.table(text = "
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(ggplot2)
# Dirk's functions
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin = "1900-01-01"))
lt$year*12 + lt$mon
}
# compute a month difference as a difference between two monnb's
diffmon <- function(d1, d2) { monnb(d2) - monnb(d1) }
spending12 <- df1 %>%
pivot_longer(cols = starts_with('20'), names_to = "month") %>%
mutate(month = as.Date(paste0(month, "_01"), "%Y_%m_%d"),
date_of_death = as.Date(date_of_death)) %>%
group_by(ID, date_of_death) %>%
mutate(diffm = diffmon(month, date_of_death)) %>%
filter(diffm >= 0 & diffm <= 12)
spending12 %>% summarise(spending = mean(value), .groups = "drop")
#> # A tibble: 2 x 3
#> ID date_of_death spending
#> <chr> <date> <dbl>
#> 1 A 2020-01-02 12.4
#> 2 B 2019-11-15 13.6
spending12_wide <- spending12 %>%
mutate(month = zoo::as.yearmon(month)) %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = diffm,
names_glue = "last_{.name}_month",
values_from = value
)
spending12_wide
#> # A tibble: 2 x 15
#> # Groups: ID, date_of_death [2]
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month last_2_month last_1_month last_0_month
#> <chr> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 A 2020-01-02 6 23 23 5 6 30 1 15 6 7 8 30 1
#> 2 B 2019-11-15 2 5 6 7 7 8 9 15 12 14 31 30 31
ggplot(spending12, aes(month, value, color = ID)) +
geom_line() +
geom_point()
Created on 2022-03-09 by the reprex package (v2.0.1)

Reshaping of data

I am stuck with reshaping data in R and I hope someone could help me out.
The data looks like this:
ID
measurement
biomarker_x
biomarker_y
1
1
10
100
1
2
11
110
1
3
12
120
2
1
20
200
2
2
19
190
2
3
21
210
And needs to be reshaped to looking like this:
ID
biomarker
measurement1
measurement2
measurement3
1
x
10
11
12
1
y
100
110
120
2
x
20
19
21
2
y
200
190
210
I tried to work with tidyr::gather and spread and with pivot_wider and pivot_longer but failed.
If someone would have a solution for applying this on multiple biomarkers I would be very thankful.
can be done in tidyr only
library(tidyr)
df <- read.table(header = T, text = 'ID measurement biomarker_x biomarker_y
1 1 10 100
1 2 11 110
1 3 12 120
2 1 20 200
2 2 19 190
2 3 21 210')
df %>% pivot_longer(starts_with('biomarker'), names_to = 'biomarker', names_prefix = 'biomarker_') %>%
pivot_wider(names_from = measurement, values_from = value, names_prefix = 'measurement_')
#> # A tibble: 4 x 5
#> ID biomarker measurement_1 measurement_2 measurement_3
#> <int> <chr> <int> <int> <int>
#> 1 1 x 10 11 12
#> 2 1 y 100 110 120
#> 3 2 x 20 19 21
#> 4 2 y 200 190 210
Created on 2021-07-06 by the reprex package (v2.0.0)
Using recast from reshape2
library(reshape2)
names(df1)[-(1:2)] <- sub("biomarker_", "", names(df1)[-(1:2)])
reshape2::recast(df1, id.var = c("ID", "measurement"),
ID + variable ~ paste0('measurement', measurement), value.var = 'value')
-output
ID variable measurement1 measurement2 measurement3
1 1 x 10 11 12
2 1 y 100 110 120
3 2 x 20 19 21
4 2 y 200 190 210
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), measurement = c(1L,
2L, 3L, 1L, 2L, 3L), biomarker_x = c(10L, 11L, 12L, 20L, 19L,
21L), biomarker_y = c(100L, 110L, 120L, 200L, 190L, 210L)),
class = "data.frame", row.names = c(NA,
-6L))
Does this work:
library(dplyr)
library(tidyr)
library(stringr)
df %>% pivot_longer(-c(ID, measurement), names_to = 'biomarker') %>% mutate(biomarker = str_extract(biomarker, '[xy]$')) %>%
pivot_wider(c(ID, biomarker), names_from = measurement, names_prefix = 'measurement', values_from = value)
# A tibble: 4 x 5
ID biomarker measurement1 measurement2 measurement3
<int> <chr> <int> <int> <int>
1 1 x 10 11 12
2 1 y 100 110 120
3 2 x 20 19 21
4 2 y 200 190 210
Here is one approach.
library(tidyverse)
dat |>
pivot_longer(
cols = starts_with("bio"),
names_to = "biomarker"
) |>
mutate(biomarker = str_remove(biomarker, "biomarker_")) |>
pivot_wider(
names_from = measurement,
values_from = value,
names_prefix = "measurement"
)
# # A tibble: 4 x 5
# ID biomarker measurement1 measurement2 measurement3
# <int> <chr> <int> <int> <int>
# 1 1 x 10 11 12
# 2 1 y 100 110 120
# 3 2 x 20 19 21
# 4 2 y 200 190 210
A pure base R option using nested ´reshape`
reshape(
reshape(
df,
direction = "long",
idvar = c("ID", "measurement"),
varying = -(1:2),
sep = "_"
),
direction = "wide",
idvar = c("ID", "time"),
timevar = "measurement"
)
gives
ID time biomarker.1 biomarker.2 biomarker.3
1.1.x 1 x 10 11 12
2.1.x 2 x 20 19 21
1.1.y 1 y 100 110 120
2.1.y 2 y 200 190 210

How to use column indices to collect values from columns in R

x y z column_indices
6 7 1 1,2
5 4 2 3
1 3 2 1,3
I have the column indices of the values I would like to collect in a separate column like so, what I want to create is something like this:
x y z column_indices values
6 7 1 1,2 6,7
5 4 2 3 2
1 3 2 1,3 1,2
What is the simplest way to do this in R?
Thanks!
In base R, we can use apply, split the column_indices on ',', convert them to integer and get the corresponding value from the row.
df$values <- apply(df, 1, function(x) {
inds <- as.integer(strsplit(x[4], ',')[[1]])
toString(x[inds])
})
df
# x y z column_indices values
#1 6 7 1 1,2 6, 7
#2 5 4 2 3 2
#3 1 3 2 1,3 1, 2
data
df <- structure(list(x = c(6L, 5L, 1L), y = c(7L, 4L, 3L), z = c(1L,
2L, 2L), column_indices = structure(c(1L, 3L, 2L), .Label = c("1,2",
"1,3", "3"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
One solution involving dplyr and tidyr could be:
df %>%
pivot_longer(-column_indices) %>%
group_by(column_indices) %>%
mutate(values = toString(value[1:n() %in% unlist(strsplit(column_indices, ","))])) %>%
pivot_wider(names_from = "name", values_from = "value")
column_indices values x y z
<chr> <chr> <int> <int> <int>
1 1,2 6, 7 6 7 1
2 3 2 5 4 2
3 1,3 1, 2 1 3 2

Create columns from aggregated row data in R

I have a data frame that contains historical price returns. The data is organized with date columns and many Asset columns (denoted as A1,A2...). Each asset column contains price return data for each unique historical date. I would like to process this data to create a data frame with many asset columns and only one row of data - with the data row containing the aggregated/average of the rows for the new columns. The new columns needs headers that are the original asset name, concatenated with date information. A simplified example of the original date follows:
> df <- read.csv("data.csv", header=T)
> df
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3
I used simple repeating numbers for the returns here. I am using a function that requires the data to be organized as follows:
> df2 <- read.csv("data2.csv", header=T)
> df2
Returns A1.Jan A1.Feb A1.Mar A2.Jan A2.Feb A2.Mar A3.Jan A3.Feb A3.Mar
1 Average 1 2 3 1 2 3 1 2 3
For clarity, A1.Jan contains the average of all Year's Jan returns. Thanks in advance for the insight and/or solution.
Take a look at the base function reshape. This is basically the same task as is solved by the last example on its help page:
reshape(df, idvar="Year", direction="wide", timevar="Month")
Year A1.Jan A2.Jan A3.Jan A1.Feb A2.Feb A3.Feb A1.Mar A2.Mar A3.Mar
1 2015 1 1 1 2 2 2 3 3 3
4 2016 1 1 1 2 2 2 3 3 3
You wanted the Year variable to remain as a column identifier but wanted the Month variable to act as a sequence that gets spread "wide".
With data.table you can do
library(data.table)
setDT(df)
df[, lapply(.SD, mean), .SDcols = names(df)[grep("^A", names(df))], by = Month
][, Returns := "Average"
][, melt(.SD, id = c("Month", "Returns"))
][, dcast(.SD, Returns ~ variable + Month, value.var = 'value', sep = ".")]
# Returns A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
#1: Average 2 1 3 2 1 3 2 1 3
In the first line we aggregate the data by Month. The part names(df)[grep("^A", names(df)) ensures that we only aggregate variables that start with the letter "A".
The second line creates variable Returns that contains the value "Average".
melt gathers you data into long format and dcast finally spreads into desired output.
data
df <- structure(list(Year = c(2015L, 2015L, 2015L, 2016L, 2016L, 2016L
), Month = c("Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), A1 = c(1L,
2L, 3L, 1L, 2L, 3L), A2 = c(1L, 2L, 3L, 1L, 2L, 3L), A3 = c(1L,
2L, 3L, 1L, 2L, 3L)), .Names = c("Year", "Month", "A1", "A2",
"A3"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))
Here's a tidyverse solution. I factored the months so they can be ordered, then used tidyr::gather() to convert into long format so I could dplyr::group_by() by month to dplyr::summarise() to find the average:
library(dplyr)
library(tidyr)
df <- read.table(text = "
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3", header = T) %>%
tbl_df()
df$Month <- df$Month %>%
factor(levels = format(ISOdate(2000, 1:12, 1), "%b"))
df_tidy <- df %>%
gather(asset, value, -Year, -Month) %>%
group_by(Month, asset) %>%
summarise(Average = mean(value)) %>%
arrange(asset, Month)
df_tidy
# # A tibble: 9 x 3
# # Groups: Month [3]
# Month asset Average
# <fct> <chr> <dbl>
# 1 Jan A1 1
# 2 Feb A1 2
# 3 Mar A1 3
# 4 Jan A2 1
# 5 Feb A2 2
# 6 Mar A2 3
# 7 Jan A3 1
# 8 Feb A3 2
# 9 Mar A3 3
# convert to wide format, as in OP - not sure of 'easy' way
# to order columns by asset.month other than using 'select()'
# (it currently sorts alphabetically).
df_tidy %>%
unite(Returns, c(asset, Month), sep = ".") %>%
spread(Returns, Average)
# # A tibble: 1 x 9
# A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2 1 3 2 1 3 2 1 3

working with integer ranges in dplyr

I have a tibble that encodes when each of 300 counties experienced a (potentially) recurrent event. The "shape of the data" is:
county event_start event_end
A 3 6
A 12 20
A 71 80
B 1 3
B 19 30
...
Some helpful characteristics here:
There is no missing data.
No county has two events that overlap (event_start_2 is always greater than event_end_1 for two events)
Within county, the events are sorted.
I want to reshape the data to be more like this:
county day event
A 1 no
A 2 no
A 3 yes
A 4 yes
A 5 yes
A 6 yes
A 7 no
...
I can imagine how to do this with a bunch of for loops and such. But is there a dplyrish way to do it?
One option would be to get the sequence between corresponding elements of 'event_start', 'event_end' with map, unnest the list output to expand the data, use complete to fill up the 'day' and replace the NA elements with 'no' for the 'event' column
library(tidyverse)
df1 %>%
transmute(county, day = map2(event_start, event_end, seq), event = 'yes') %>%
unnest %>%
group_by(county) %>%
complete(day = seq_len(max(day))) %>%
mutate(event = replace(event, is.na(event), 'no'))
# A tibble: 110 x 3
# Groups: county [2]
# county day event
# <chr> <int> <chr>
# 1 A 1 no
# 2 A 2 no
# 3 A 3 yes
# 4 A 4 yes
# 5 A 5 yes
# 6 A 6 yes
# 7 A 7 no
# 8 A 8 no
# 9 A 9 no
#10 A 10 no
# ... with 100 more rows
data
df1 <- structure(list(county = c("A", "A", "A", "B", "B"), event_start = c(3L,
12L, 71L, 1L, 19L), event_end = c(6L, 20L, 80L, 3L, 30L)), .Names = c("county",
"event_start", "event_end"), class = "data.frame", row.names = c(NA,
-5L))

Resources