`pivot_longer` operation with different naming schemes - r

I have a df of the form:
df <- tibble(
id = c(1,2,3),
x02val_a = c(0,1,0),
x03val_a = c(1,0,0),
x04val_a = c(0,1,1),
x02val_b = c(0,2,0),
x03val_b = c(1,3,0),
x04val_b = c(0,1,2),
age02 = c(1,2,3),
age03 = c(2,3,4),
age04 = c(3,4,5)
)
I want to bring it into tidy format like:
# A tibble: 9 x 5
id year val_a val_b age
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 02 0 0 1
2 1 03 1 2 2
...
The answer from here worked for simpler naming schemes. With the naming scheme present in my real dataset, however, I struggle to define a regex that matches all patterns.
My attempts so far all missed one or the other schemes. I can grab the one with the variable name first and the year last (age02) or the one with the type and year first and the name last (x02var) but not both at the same time.
Is there a way to do this with a) a regex? or b) some combinations or parameterizations of the pivot_longer call(s)?
I know there is always the possibility to do it with a left join at the end as I described here
I tried to define the regex with two groups inside each other (since the groups are not strictly serial [meaning: left, right], which led me to):
df %>%
pivot_longer(-id,names_to = c('.value', 'year'),names_pattern = '([a-z]+(\\d+)[a-z]+_[a-z])')

Let's try. It seems this name pattern works:
> df %>%
pivot_longer(-id,
names_to = c('.value', 'year','.value'),
names_pattern = '([a-z]+)(\\d+)([a-z_]*)')
# A tibble: 9 x 5
id year xval_a xval_b age
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 02 0 0 1
2 1 03 1 1 2
3 1 04 0 0 3
4 2 02 1 2 2
5 2 03 0 3 3
6 2 04 1 1 4
7 3 02 0 0 3
8 3 03 0 0 4
9 3 04 1 2 5

It's a bit roundabout, but because of the inconsistent name style, you might first rename your columns to match an easier pattern. There are 3 possible pieces of information in your names, but (at least in your example) each column has only 2 of these.
The relevant pieces are:
Multiple continuous matches to "[a-z_]", which either occurs after "x" or after the 2 digits. Whichever of these is present will get moved to the beginning of the name; whichever is not present will just not return anything and not take up any space.
2 digits, which get moved to the end.
The parameterization possible with pivot_longer's ".value" option gives you column names in just one step based on this cleaner pattern. Should be trivial enough to adjust the pattern as needed, e.g. to fit a different number of digits.
library(dplyr)
library(tidyr)
df %>%
rename_all(stringr::str_replace, "x?([a-z_]*)(\\d{2})([a-z_]*)", "\\1\\3\\2") %>%
pivot_longer(-id, names_to = c(".value", "year"), names_pattern = "([a-z_]+)(\\d{2})")
#> # A tibble: 9 x 5
#> id year val_a val_b age
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 02 0 0 1
#> 2 1 03 1 1 2
#> 3 1 04 0 0 3
#> 4 2 02 1 2 2

Related

apply function or loop within mutate

Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13

Cast multiple values in R [duplicate]

This question already has answers here:
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 1 year ago.
Is there a way to cast multiple values in R
asd <- data.frame(week = c(1,1,2,2), year = c("2019","2020","2019","2020"), val = c(1,2,3,4), cap = c(3,4,6,7))
Expected output
week 2019_val 2020_val 2019_cap 2020_cap
1 1 2 3 6
2 3 4 4 7
If you want to do this in base R, you can use reshape:
reshape(asd, direction = "wide", idvar = "week", timevar = "year", sep = "_")
#> week val_2019 cap_2019 val_2020 cap_2020
#> 1 1 1 3 2 4
#> 3 2 3 6 4 7
Note that it is best not to start your new column names with the year, since variable names beginning with numbers are not legal in R, and therefore always need to be quoted. It becomes quite tiresome to write asd$'2020_val' rather than asd$val_2020 and can often lead to errors when one forgets the quotes.
With tidyr::pivot_wider you could do:
asd <- data.frame(week = c(1,1,2,2), year = c("2019","2020","2019","2020"), val = c(1,2,3,4), cap = c(3,4,6,7))
tidyr::pivot_wider(asd, names_from = year, values_from = c(val, cap), names_glue = "{year}_{.value}")
#> # A tibble: 2 × 5
#> week `2019_val` `2020_val` `2019_cap` `2020_cap`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3 4
#> 2 2 3 4 6 7
For completion, here is data.table option -
library(data.table)
dcast(setDT(asd), week~year, value.var = c('val', 'cap'))
# week val_2019 val_2020 cap_2019 cap_2020
#1: 1 1 2 3 4
#2: 2 3 4 6 7
Slightly different approach using pivot_longer and pivot_wider together:
library(tidyr)
library(dplyr)
asd %>%
pivot_longer(
cols = -c(week, year)
) %>%
pivot_wider(
names_from = c(year, name)
)
week `2019_val` `2019_cap` `2020_val` `2020_cap`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 2 4
2 2 3 6 4 7

Strange behavior with a conditional mutate with dplyr

My apologies if this topic has been discussed somewhere, I was not able to find it out.
I was trying to apply a quite simple conditional mutate() with dplyr when I noticed something quite weird to me, I explain:
Let's say that in a data.frame I want to modify a variable (here VALUE) according to the value of a specific row in each group (here COND).
The modification is: "if the last value of COND within the current group is 0, then set VALUE to 99 for the current group, otherwhise do nothing"
Here's what I naturally wrote:
tab <- data.frame(
ID = c(rep(1,3), rep(2,3)),
COND = c(c(1,0,0), rep(1,3)),
VALUE = 1:6
)
tab %>%
group_by(ID) %>%
mutate(VALUE = ifelse(COND[n()] == 0,
99,
VALUE))
# ID COND VALUE
# <dbl> <dbl> <dbl>
# 1 1 1 99
# 2 1 0 99
# 3 1 0 99
# 4 2 1 4
# 5 2 1 4 <
# 6 2 1 4 <
The propagation went well for the first group since VALUE is now 99 which is legitimate (COND == 0 in row 3) whereas I was surprised to see that VALUE also changed for the other group by propagating the first value of VALUE within the group while the condition is not fulfilled.
Can someone enlight me on what I am misunderstanding here?
Expected result was:
# ID COND VALUE
# <dbl> <dbl> <dbl>
# 1 1 1 99
# 2 1 0 99
# 3 1 0 99
# 4 2 1 4
# 5 2 1 5 <
# 6 2 1 6 <
[edit] I also tried using case_when() which apparently I do not manage well either:
tab %>%
group_by(ID) %>%
mutate(VALUE = case_when(
COND[n()] == 0 ~ 99,
TRUE ~ VALUE
))
# Erreur : must be a double vector, not an integer vector
One workaround that would be to calculate an intermediate variable, but I am quite surprised having to do that.
Possible solution:
tab %>%
group_by(ID) %>%
mutate(TEST_COND = COND[n()] == 0,
VALUE = ifelse(TEST_COND, 99, VALUE))
# ID COND VALUE TEST_COND
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 99 TRUE
# 2 1 0 99 TRUE
# 3 1 0 99 TRUE
# 4 2 1 4 FALSE
# 5 2 1 5 FALSE
# 6 2 1 6 FALSE
# Yeepee
Try this
library(dplyr)
tab <- data.frame(
ID = c(rep(1,3), rep(2,3)),
COND = c(1, rep(0,2), rep(1,3)),
VALUE = 1:6
)
tab %>%
group_by(ID) %>%
mutate(VALUE = case_when(last(COND) == 0 ~ 99L,
TRUE ~ VALUE))
#> # A tibble: 6 x 3
#> # Groups: ID [2]
#> ID COND VALUE
#> <dbl> <dbl> <int>
#> 1 1 1 99
#> 2 1 0 99
#> 3 1 0 99
#> 4 2 1 4
#> 5 2 1 5
#> 6 2 1 6
Created on 2020-05-12 by the reprex package (v0.3.0)

Dplyr solution using slice and group

Ciao, Here is my replicating example.
a=c(1,2,3,4,5,6)
a1=c(15,17,17,16,14,15)
a2=c(0,0,1,1,1,0)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,a1,a2,b,c,d,e,f,g,h,i)
names(mydata) = c("id","age","gender","drop1","year1","drop2","year2","drop3","year3","drop4","year4")
mydata2 <- reshape(mydata, direction = "long", varying = list(c("year1","year2","year3","year4"), c("drop1","drop2","drop3","drop4")),v.names = c("year", "drop"), idvar = "X", timevar = "Year", times = c(1:4))
x1 = mydata2 %>%
group_by(id) %>%
slice(which(drop==1)[1])
x2 = mydata2 %>%
group_by(id) %>%
slice(which(drop==0)[1])
I have data "mydata2" which is tall such that every ID has many rows.
I want to make new data set "x" such that every ID has one row that is based on if they drop or not.
The first of drop1 drop2 drop3 drop4 that equals to 1, I want to take the year of that and put that in a variable dropYEAR. If none of drop1 drop2 drop3 drop4 equals to 1 I want to put the last data point in year1 year2 year3 year4 in the variable dropYEAR.
Ultimately every ID should have 1 row and I want to create 2 new columns: didDROP equals to 1 if the ID ever dropped or 0 if the ID did not ever drop. dropYEAR equals to the year of drop if didDROP equals to 1 or equals to the last reported year1 year2 year3 year4 if the ID did not ever drop. I try to do this in dplyr but this gives part of what I want only because it gets rid of ID values that equals to 0.
This is desired output, thank you to #Wimpel
First mydata2 %>% arrange(id) to understand the dataset, then using dplyr first and lastwe can pull the first year where drop==1 and the last year in case of drop never get 1 where drop is not null. Usingcase_when to check didDROP as it has a nice magic in dealing with NAs.
library(dplyr)
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year[!is.na(drop)]),dropY)) %>%
slice(1)
#Update
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year),dropY),
didDROP=case_when(any(drop==1) ~ 1, #Return 1 if there is any drop=1 o.w it will return 0
TRUE ~ 0)) %>%
select(-dropY) %>% slice(1)
# A tibble: 6 x 9
# Groups: id [6]
id age gender Year year drop X dropYEAR didDROP
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 15 0 1 2010 1 1 2010 1
2 2 17 0 1 2010 0 2 2012 1
3 3 17 1 1 2010 NA 3 2014 0
4 4 16 1 1 2010 NA 4 2012 1
5 5 14 1 1 2010 0 5 2014 0
6 6 15 0 1 2010 NA 6 2014 0
I hope this what you're looking for.
You can sort by id, drop and year, conditionally on dropping or not:
library(dplyr)
mydata2 %>%
mutate(drop=ifelse(is.na(drop),0,drop)) %>%
arrange(id,-drop,year*(2*drop-1)) %>%
group_by(id) %>%
slice(1) %>%
select(id,age,gender,didDROP=drop,dropYEAR=year)
# A tibble: 6 x 5
# Groups: id [6]
id age gender didDROP dropYEAR
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 15 0 1 2010
2 2 17 0 1 2012
3 3 17 1 0 2014
4 4 16 1 1 2012
5 5 14 1 0 2014
6 6 15 0 0 2014

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

Resources