Use pivot_longer to seperate columns - r

I have a dataframe that looks like
id = c("1", "2", "3")
IN1999 = c(1, 1, 0)
IN2000 = c(1, 0, 1)
TEST1999 = c(10, 12, NA)
TEST2000 = c(15, NA, 11)
df <- data.frame(id, IN1999, IN2000, TEST1999, TEST2000)
I am trying to use pivot_longer to change it into this form:
id year IN TEST
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11
My current code looks like this
df %>%
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = 4)
but obviousely by setting names_sep = 4, r cuts IN1999 and IN2000 at the wrong place. How can I set the argument so that r can separate the column name from the last four digits?

The names_sep-argument in pivot_longer also accepts regex expressions, that will allow you to split before the occurrence of four digits as in this example below:
library(tidyr)
df |>
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = "(?=\\d{4})")
Output:
# A tibble: 6 × 4
id year IN TEST
<chr> <chr> <dbl> <dbl>
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11

Related

apply function or loop within mutate

Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13

How best to transform quarterly data into monthly

Below is the sample data. I receive the data in a form such as this. Each row is a quarter and then the months are columns inside of it. Trying to do some month over month calculation but am thinking that I transform the data frame in order to do so. I am thinking that I would do a pivot_longer but not seeing anything online that is of a similar vein. Below is the desired result
year<-c(2018,2018,2018,2018,2019,2019,2019,2019,2020,2020,2020,2020)
qtr<-c(1,2,3,4,1,2,3,4,1,2,3,4)
avgemp <-c(3,5,7,9,11,13,15,17,19,21,23,25)
month1emp<-c(2,4,6,8,10,12,14,16,18,20,22,24)
month2emp<-c(3,5,7,9,11,13,15,17,19,21,23,25)
month3emp<-c(4,6,8,10,12,14,16,18,20,22,24,26)
sample<-data.frame(year,qtr,month1emp,month2emp,month3emp)
Desired Result
year qtr month employment
2018 1 1 2
2018 1 2 3
2018 1 3 4
2018 2 4 4
2018 2 4 5
2018 2 4 6
and so on. At 2019, the month value would restart and go from 1 to 12.
We could use pivot_longer on the 'month' columns, specify the names_pattern to capture the digits ((\\d+)) followed by the emp for the 'month' and the .value columns
library(dplyr)
library(tidyr)
sample %>%
pivot_longer(cols = starts_with('month'),
names_to = c("month", ".value"), names_pattern = ".*(\\d+)(emp)")%>%
rename(employment = emp)
-output
# A tibble: 36 x 4
year qtr month employment
<dbl> <dbl> <chr> <dbl>
1 2018 1 1 2
2 2018 1 2 3
3 2018 1 3 4
4 2018 2 1 4
5 2018 2 2 5
6 2018 2 3 6
7 2018 3 1 6
8 2018 3 2 7
9 2018 3 3 8
10 2018 4 1 8
# … with 26 more rows
If we need to increment the 'month' based on 'qtr' value
sample %>%
pivot_longer(cols = starts_with('month'),
names_to = c("month", ".value"), names_pattern = ".*(\\d+)(emp)")%>%
rename(employment = emp) %>%
mutate(month = as.integer(month) + c(0, 3, 6, 9)[qtr])
# A tibble: 36 x 4
year qtr month employment
<dbl> <dbl> <dbl> <dbl>
1 2018 1 1 2
2 2018 1 2 3
3 2018 1 3 4
4 2018 2 4 4
5 2018 2 5 5
6 2018 2 6 6
7 2018 3 7 6
8 2018 3 8 7
9 2018 3 9 8
10 2018 4 10 8
# … with 26 more rows
Base R solution:
# Create a vector of boolean values,
# denoting whether or not the columns should
# be unpivoted: unpivot_cols => boolean vector
unpivot_cols <- startsWith(
names(df),
"month"
)
# Reshape the data.frame, calculate
# the month value: rshpd_df => data.frame
rshpd_df <- transform(
reshape(
df,
direction = "long",
varying = names(df)[unpivot_cols],
ids = NULL,
timevar = "month",
times = seq_len(sum(unpivot_cols)),
v.names = "employment",
new.row.names = seq_len(
nrow(df) * ncol(df)
)
),
month = ((12 / 4) * (qtr - 1)) + month
)
# Order the data.frame by year and month:
# ordered_df => data.frame
ordered_df <- with(
rshpd_df,
rshpd_df[order(year, month),]
)

A computation efficient way to find the IDs of the Type 1 rows just above and below each Type 2 rows?

I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA

Pivot_longer to manipulate table

I would like to pivot variables nclaims, npatients, nproviders to show up underneath groups.
I believe I should be using pivot_longer but it doesn't work.
library(tidyr)
ptype <- c(0,1,2,0,1)
groups <- c(rep(1,3), rep(2,2))
nclaims <- c(10,23,32,12,8)
nproviders <- c(2,4,5,1,1)
npatients <- c(8, 20, 29, 9, 6)
dta <- data.frame(ptype=ptype, groups=groups, nclaims=nclaims, nproviders=nproviders, npatients=npatients)
table <- pivot_longer(everything(dta), names_to = "groups", values_to=c("nclaims", "npatients", "nproviders"))
Desired output:
We need to use pivot_longer, then pivot_wider:
dta %>%
pivot_longer(nclaims:npatients) %>%
# values_fill = 0 changes NA values to 0, as in your desired result
pivot_wider(names_from = ptype, values_from = value,
values_fill = 0)
groups name `0` `1` `2`
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 nclaims 10 23 32
2 1 nproviders 2 4 5
3 1 npatients 8 20 29
4 2 nclaims 12 8 0
5 2 nproviders 1 1 0
6 2 npatients 9 6 0
another approach, using reshape2::recast()
library( reshape2 )
recast( dta, groups + variable ~ ptype, id.var = c("ptype", "groups") )
# groups variable 0 1 2
# 1 1 nclaims 10 23 32
# 2 1 nproviders 2 4 5
# 3 1 npatients 8 20 29
# 4 2 nclaims 12 8 NA
# 5 2 nproviders 1 1 NA
# 6 2 npatients 9 6 NA

How to find the first observation of a column that matches a condition

I have a data frame:
df = tibble(a=c(7,6,10,12,12), b=c(3,5,8,8,7), c=c(4,4,12,15,20), week=c(1,2,3,4,5))
# A tibble: 5 x 4
a b c week
<dbl> <dbl> <dbl> <dbl>
1 7 3 4 1
2 6 5 4 2
3 10 8 12 3
4 12 8 15 4
5 12 7 20 5
and i want for every column a, b and c the week in which the observation is equal to or exceeds 10.
I.e. for column a it would be week 3, for column b it would be week NA, for column c it would be week 3 as well.
A desired ouotcome could look like this:
tibble(abc=c("a", NA, "b"), value=c(10, NA, 12), week=c(3, NA, 3))
# A tibble: 3 x 3
abc value week
<chr> <dbl> <dbl>
1 a 10 3
2 b NA NA
3 c 12 3
One way would be to get the data in long format and for each column name select the first value that is greater than 10. We fill the missing combinations with complete.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -week, names_to = 'abc') %>%
group_by(abc) %>%
slice(which(value >= 10)[1]) %>%
ungroup %>%
complete(abc = names(df)[-4])
# A tibble: 3 x 3
# abc week value
# <chr> <dbl> <dbl>
#1 a 3 10
#2 b NA NA
#3 c 3 12
Another way is to first calculate what we want and then transform the dataset into long format.
df %>%
summarise(across(a:c, list(week = ~week[which(. >= 10)[1]],
value = ~.[. >= 10][1]))) %>%
pivot_longer(cols = everything(),
names_to = c('abc', '.value'),
names_sep = "_")

Resources