gather function in R to match patterns in character strings

gather function in R to match patterns in character strings - r

I want to gather reshape wide table to long table. The columns i want to gather have a pattern. For now i only managed to gather them by their position. How can i change this to gather them by the patterns in column names? please only use the gather function.
I have included an example dataset, however in the real dataset there are many more columns. Therefore I would like to gather all columns that:
start with an f or m
are followed by one OR two numbers
dput(head(test1, 1))
structure(list(startdate = "2019-11-06", id = "POL55", m0_9 = NA_real_,
m10_19 = NA_real_, m20_29 = NA_real_, m30_39 = NA_real_,
m40_49 = 32, m50_59 = NA_real_, m60_69 = NA_real_, m70 = NA_real_,
f0_9 = 32, f10_19 = NA_real_, f20_29 = NA_real_, f30_39 = NA_real_,
f40_49 = NA_real_, f50_59 = NA_real_, f60_69 = NA_real_,
f70 = NA_real_), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"))
df_age2 <- test1 %>%
gather(age_cat, count, m0_9:f70 )
df_age2
expected output (there will be many more columns that are not gathered). The count should of course count...
startdate id age_cat count
<chr> <chr> <chr> <dbl>
1 2019-11-06 POL55 m0_9 NA
2 2019-11-06 POL56 m0_9 NA
3 2019-11-06 POL57 m0_9 NA
4 2019-11-06 POL58 m0_9 NA
5 2019-11-06 POL59 m0_9 NA
6 2019-11-06 POL60 m0_9 NA
7 2019-11-06 POL61 m0_9 NA
8 2019-11-06 POL62 m0_9 NA
9 2019-11-06 POL63 m0_9 NA
10 2019-11-06 POL64 m0_9 NA

Use starts_with:
test1 %>%
gather(age_bucket, count, c(starts_with("m"), starts_with("f")))

We can use pivot_longer from tidyr
library(dplyr)
library(tidyr)
test1 %>%
pivot_longer(cols = -c(startdate, id), names_to = c('.value', 'grp'), names_sep="_")
Or it could be
test1 %>%
pivot_longer(cols = -c(startdate, id),
names_to = c( '.value', 'grp'), names_pattern = "^([a-z])(.*)")
# A tibble: 8 x 5
# startdate id grp m f
# <chr> <chr> <chr> <dbl> <dbl>
#1 2019-11-06 POL55 0_9 NA 32
#2 2019-11-06 POL55 10_19 NA NA
#3 2019-11-06 POL55 20_29 NA NA
#4 2019-11-06 POL55 30_39 NA NA
#5 2019-11-06 POL55 40_49 32 NA
#6 2019-11-06 POL55 50_59 NA NA
#7 2019-11-06 POL55 60_69 NA NA
#8 2019-11-06 POL55 70 NA NA
Or may be
test1 %>%
pivot_longer(cols = -c(startdate, id),
names_to = c( 'grp', '.value'), names_pattern = "^([a-z])(.*)")
# A tibble: 2 x 11
# startdate id grp `0_9` `10_19` `20_29` `30_39` `40_49` `50_59` `60_69` `70`
# <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2019-11-06 POL55 m NA NA NA NA 32 NA NA NA
#2 2019-11-06 POL55 f 32 NA NA NA NA NA NA NA
Or it can be
test1 %>%
pivot_longer(cols = matches("^(f|m)\\d+_?\\d*$"), names_to = 'age_bucket',
values_to = 'count')
# A tibble: 16 x 4
# startdate id age_bucket count
# <chr> <chr> <chr> <dbl>
# 1 2019-11-06 POL55 m0_9 NA
# 2 2019-11-06 POL55 m10_19 NA
# 3 2019-11-06 POL55 m20_29 NA
# 4 2019-11-06 POL55 m30_39 NA
# 5 2019-11-06 POL55 m40_49 32
# 6 2019-11-06 POL55 m50_59 NA
# 7 2019-11-06 POL55 m60_69 NA
# 8 2019-11-06 POL55 m70 NA
# 9 2019-11-06 POL55 f0_9 32
#10 2019-11-06 POL55 f10_19 NA
#11 2019-11-06 POL55 f20_29 NA
#12 2019-11-06 POL55 f30_39 NA
#13 2019-11-06 POL55 f40_49 NA
#14 2019-11-06 POL55 f50_59 NA
#15 2019-11-06 POL55 f60_69 NA
#16 2019-11-06 POL55 f70 NA

Related

Convert semi-long data into wide data

I'm very sure there should be a simple alternative but I'm not able to figure it out. Currently using a for loop which is not optimal.
My dataframe is like this:
NAME <- c("ABC", "ABC", "ABC", "DEF", "GHI", "GHI", "JKL", "JKL", "JKL", "MNO")
YEAR <- c(2012, 2013, 2014, 2012, 2012, 2013, 2012, 2014, 2016, 2013)
MARKS <- c(45, 75, 95, 91, 75, 76, 85, 88, 89, 77)
MAXIMUM <- c(95, NA, NA, 91, 76, NA, 89, NA, NA, 77)
DF <- data.frame(
NAME,
YEAR,
MARKS,
MAXIMUM
)
> DF
NAME YEAR MARKS MAXIMUM
1 ABC 2012 45 95
2 ABC 2013 75 NA
3 ABC 2014 95 NA
4 DEF 2012 91 91
5 GHI 2012 75 76
6 GHI 2013 76 NA
7 JKL 2012 85 89
8 JKL 2014 88 NA
9 JKL 2016 89 NA
10 MNO 2013 77 77
I want to have only one name per row and each year-wise details (YEAR, MARKS and MAXIMUM columns) should be spread as individual headers. I have tried to use tidyr::pivot_wider function but was not successful.
I have given the sample output here:
Required output

Perhaps you could enumerate by NAME first based on row_number(). Then, use pivot_wider:
library(tidyverse)
DF %>%
group_by(NAME) %>%
mutate(n = row_number()) %>%
pivot_wider(NAME, names_from = n, values_from = c(YEAR, MARKS, MAXIMUM))
Output
NAME YEAR_1 YEAR_2 YEAR_3 MARKS_1 MARKS_2 MARKS_3 MAXIMUM_1 MAXIMUM_2 MAXIMUM_3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2012 2013 2014 45 75 95 95 NA NA
2 DEF 2012 NA NA 91 NA NA 91 NA NA
3 GHI 2012 2013 NA 75 76 NA 76 NA NA
4 JKL 2012 2014 2016 85 88 89 89 NA NA
5 MNO 2013 NA NA 77 NA NA 77 NA NA
Or, as mentioned by #RobertoT, you could make YEAR a factor and then line up your YEAR values. Using complete you can fill in NA for missing YEAR. The final select will order your columns.
DF$YEAR_FAC = factor(DF$YEAR)
DF %>%
group_by(NAME) %>%
complete(YEAR_FAC, fill = list(YEAR = NA)) %>%
mutate(n = row_number()) %>%
pivot_wider(NAME, names_from = n, values_from = c(YEAR, MARKS, MAXIMUM)) %>%
select(NAME, ends_with(as.character(1:nlevels(DF$YEAR_FAC))))
Output
NAME YEAR_1 MARKS_1 MAXIMUM_1 YEAR_2 MARKS_2 MAXIMUM_2 YEAR_3 MARKS_3 MAXIMUM_3 YEAR_4 MARKS_4 MAXIMUM_4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2012 45 95 2013 75 NA 2014 95 NA NA NA NA
2 DEF 2012 91 91 NA NA NA NA NA NA NA NA NA
3 GHI 2012 75 76 2013 76 NA NA NA NA NA NA NA
4 JKL 2012 85 89 NA NA NA 2014 88 NA 2016 89 NA
5 MNO NA NA NA 2013 77 77 NA NA NA NA NA NA

In addition to #Ben+1 solution we could a code that I recently learned to order the columns Combining two dataframes with alternating column position
DF %>%
group_by(NAME) %>%
mutate(n = row_number()) %>%
pivot_wider(NAME, names_from = n, values_from = c(YEAR, MARKS, MAXIMUM)) %>%
select(-NAME) %>%
dplyr::select(all_of(c(matrix(names(.), ncol = 3, byrow = TRUE))))
NAME YEAR_3 MARKS_3 MAXIMUM_3 YEAR_1 MARKS_1 MAXIMUM_1 YEAR_2 MARKS_2 MAXIMUM_2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2014 95 NA 2012 45 95 2013 75 NA
2 DEF NA NA NA 2012 91 91 NA NA NA
3 GHI NA NA NA 2012 75 76 2013 76 NA
4 JKL 2016 89 NA 2012 85 89 2014 88 NA
5 MNO NA NA NA 2013 77 77 NA NA NA

I think all the previous answers have overlooked that the expected output is based on YEAR as a factor. The expected output has 4 grouped-columns per row, not 3. Therefore, you avoid mixing different years in the same column.
You can assign a number for every row- grp - based on the level of Year as a factor(). Also, if you first pivot longer, you can arrange the values as you want and then pivot wider everything so the columns are sorted as you expect:
library(tidyverse)
DF %>%
mutate(grp = as.integer(factor(DF$YEAR,unique(DF$YEAR)))) %>%
pivot_longer(cols=c('YEAR','MARKS','MAXIMUM'), names_to = 'COLNAMES', values_to= 'COL_VALUES') %>%
arrange(NAME,grp) %>%
pivot_wider(names_from = c(COLNAMES,grp), values_from= COL_VALUES, names_sep = '')
Output:
# A tibble: 5 x 13
NAME YEAR1 MARKS1 MAXIMUM1 YEAR2 MARKS2 MAXIMUM2 YEAR3 MARKS3 MAXIMUM3 YEAR4 MARKS4 MAXIMUM4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2012 45 95 2013 75 NA 2014 95 NA NA NA NA
2 DEF 2012 91 91 NA NA NA NA NA NA NA NA NA
3 GHI 2012 75 76 2013 76 NA NA NA NA NA NA NA
4 JKL 2012 85 89 NA NA NA 2014 88 NA 2016 89 NA
5 MNO NA NA NA 2013 77 77 NA NA NA NA NA NA
However, I suggest you to keep track of the years to not make the tibble more confusing:
DF$YEAR = factor(DF$YEAR)
DF %>%
pivot_longer(cols=c('MARKS','MAXIMUM'), names_to = 'COLNAMES', values_to= 'COL_VALUES') %>%
arrange(NAME,YEAR) %>%
pivot_wider(names_from = c(COLNAMES,YEAR), values_from= COL_VALUES)
# A tibble: 5 x 9
NAME MARKS_2012 MAXIMUM_2012 MARKS_2013 MAXIMUM_2013 MARKS_2014 MAXIMUM_2014 MARKS_2016 MAXIMUM_2016
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 45 95 75 NA 95 NA NA NA
2 DEF 91 91 NA NA NA NA NA NA
3 GHI 75 76 76 NA NA NA NA NA
4 JKL 85 89 NA NA 88 NA 89 NA
5 MNO NA NA 77 77 NA NA NA NA

Here a version with data.table:
library(data.table)
DT <- setDT(DF)
# numerotate the line
DT[,I := .I - .I[1] + 1,by = NAME]
# melt to have only three columns
tmp <- melt(DT,measure.vars = c("YEAR","MARKS","MAXIMUM"))
# transforming to wide
dcast(tmp,
NAME ~ paste0(variable,I),
value.var = "value")
NAME MARKS1 MARKS2 MARKS3 MAXIMUM1 MAXIMUM2 MAXIMUM3 YEAR1 YEAR2 YEAR3
1: ABC 45 75 95 95 NA NA 2012 2013 2014
2: DEF 91 NA NA 91 NA NA 2012 NA NA
3: GHI 75 76 NA 76 NA NA 2012 2013 NA
4: JKL 85 88 89 89 NA NA 2012 2014 2016
5: MNO 77 NA NA 77 NA NA 2013 NA NA

Collapse data frame so NAs are removed

I want to collapse this data frame so NA's are removed. How to accomplish this? Thanks!!
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
spread(df, id, q1)
row 1 2 3 4 5
1 23 NA NA NA NA
2 55 NA NA NA NA
3 7 NA NA NA NA
4 NA 88 NA NA NA
5 NA 90 NA NA NA
6 NA NA 34 NA NA
7 NA NA NA 11 NA
8 NA NA NA NA 22
9 NA NA NA NA 89
I want it to look like this:
1 2 3 4 5
23 88 34 11 22
55 90 NA NA 89
7 NA NA NA NA
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The row should be created on the sequence of 'id'. In addition, pivot_wider would be a more general function compared to spread
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = id, values_from = q1) %>%
select(-row)
-output
# A tibble: 3 × 5
`1` `2` `3` `4` `5`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23 88 34 11 22
2 55 90 NA NA 99
3 7 NA NA NA NA
Or use dcast
library(data.table)
dcast(setDT(df), rowid(id) ~ id, value.var = 'q1')[, id := NULL][]
1 2 3 4 5
<num> <num> <num> <num> <num>
1: 23 88 34 11 22
2: 55 90 NA NA 99
3: 7 NA NA NA NA

Here's a base R solution. I sort each column so the non-NA values are at the top, find the number of non-NA values in the column with the most non-NA values (n), and return the top n rows from the data frame.
library(tidyr)
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
df <- spread(df, id, q1)
collapse_df <- function(df) {
move_na_to_bottom <- function(x) x[order(is.na(x))]
sorted <- sapply(df, move_na_to_bottom)
count_non_na <- function(x) sum(!is.na(x))
n <- max(apply(df, 2, count_non_na))
sorted[1:n, ]
}
collapse_df(df[, -1])

Add new columns with custom function using mutate

I want to do a simple and add a new column using dplyr mutate for that. Basically I have a DF with lots of columns and I want to select some of them, just the ones containing hist_avg, tgt_ and monthyl_X_ly. This should be simple and adding a new column starting with "fct_" + metric shouldn't be an issue. However, as you may see below, it adds the column but with a weird name (fct_visits$hist_avg_visits and fct_revenue$hist_avg_revenue_lcy).
Also, not sure but I tried to do it using mutate + across since it would save me lots of lines of code and couldn't figure out on how to do that.
library(tidyverse)
(example <- tibble(brand = c("Brand A", "Brand A", "Brand A", "Brand A", "Brand A"),
country = c("Country A", "Country A", "Country A", "Country A", "Country A"),
date = c("2020-08-01", "2020-08-02", "2020-08-03", "2020-08-04", "2020-08-05"),
visits = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
visits_ly = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
tgt_visits = c(2491306, 2491306, 2491306, 2491306, 2491306),
hist_avg_visits = c(177185, 175758, 225311, 210871, 197405),
monthly_visits_ly = c(3765612, 3765612, 3765612, 3765612, 3765612),
revenue_lcy = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
revenue_ly = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
tgt_revenue_lcy = c(48872737, 48872737, 48872737, 48872737, 48872737),
hist_avg_revenue_lcy = c(231101, 222236, 276497, 259775, 251167),
monthly_revenue_lcy_ly = c(17838660, 17838660, 17838660, 17838660, 17838660))) %>%
print(width = Inf)
#> # A tibble: 5 x 13
#> brand country date visits visits_ly tgt_visits hist_avg_visits
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Brand A Country A 2020-08-01 NA NA 2491306 177185
#> 2 Brand A Country A 2020-08-02 NA NA 2491306 175758
#> 3 Brand A Country A 2020-08-03 NA NA 2491306 225311
#> 4 Brand A Country A 2020-08-04 NA NA 2491306 210871
#> 5 Brand A Country A 2020-08-05 NA NA 2491306 197405
#> monthly_visits_ly revenue_lcy revenue_ly tgt_revenue_lcy hist_avg_revenue_lcy
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3765612 NA NA 48872737 231101
#> 2 3765612 NA NA 48872737 222236
#> 3 3765612 NA NA 48872737 276497
#> 4 3765612 NA NA 48872737 259775
#> 5 3765612 NA NA 48872737 251167
#> monthly_revenue_lcy_ly
#> <dbl>
#> 1 17838660
#> 2 17838660
#> 3 17838660
#> 4 17838660
#> 5 17838660
first_forecast <- function(dataset, metric) {
avg_metric <- select(dataset, paste0("hist_avg_", metric))
tgt_metric <- select(dataset, paste0("tgt_", metric))
monthly_metric <- select(dataset, paste0("monthly_", metric, "_ly"))
output <- avg_metric * (tgt_metric / monthly_metric)
return(output)
}
example %>%
mutate(fct_visits = first_forecast(., "visits"),
fct_revenue = first_forecast(., "revenue_lcy")) %>%
print(width = Inf)
#> # A tibble: 5 x 15
#> brand country date visits visits_ly tgt_visits hist_avg_visits
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Brand A Country A 2020-08-01 NA NA 2491306 177185
#> 2 Brand A Country A 2020-08-02 NA NA 2491306 175758
#> 3 Brand A Country A 2020-08-03 NA NA 2491306 225311
#> 4 Brand A Country A 2020-08-04 NA NA 2491306 210871
#> 5 Brand A Country A 2020-08-05 NA NA 2491306 197405
#> monthly_visits_ly revenue_lcy revenue_ly tgt_revenue_lcy hist_avg_revenue_lcy
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3765612 NA NA 48872737 231101
#> 2 3765612 NA NA 48872737 222236
#> 3 3765612 NA NA 48872737 276497
#> 4 3765612 NA NA 48872737 259775
#> 5 3765612 NA NA 48872737 251167
#> monthly_revenue_lcy_ly fct_visits$hist_avg_visits
#> <dbl> <dbl>
#> 1 17838660 117225.
#> 2 17838660 116280.
#> 3 17838660 149064.
#> 4 17838660 139511.
#> 5 17838660 130602.
#> fct_revenue$hist_avg_revenue_lcy
#> <dbl>
#> 1 633149.
#> 2 608862.
#> 3 757521.
#> 4 711708.
#> 5 688124.
Created on 2020-07-28 by the reprex package (v0.3.0)

Pointing to the great sugestion of #Onyambu the final part of your code should be this:
example %>%
cbind(fct_visits = first_forecast(., "visits"),
fct_revenue = first_forecast(., "revenue_lcy")) %>%
print(width = Inf)
brand country date visits visits_ly tgt_visits hist_avg_visits monthly_visits_ly revenue_lcy
1 Brand A Country A 2020-08-01 NA NA 2491306 177185 3765612 NA
2 Brand A Country A 2020-08-02 NA NA 2491306 175758 3765612 NA
3 Brand A Country A 2020-08-03 NA NA 2491306 225311 3765612 NA
4 Brand A Country A 2020-08-04 NA NA 2491306 210871 3765612 NA
5 Brand A Country A 2020-08-05 NA NA 2491306 197405 3765612 NA
revenue_ly tgt_revenue_lcy hist_avg_revenue_lcy monthly_revenue_lcy_ly hist_avg_visits hist_avg_revenue_lcy
1 NA 48872737 231101 17838660 117224.5 633149.5
2 NA 48872737 222236 17838660 116280.4 608862.0
3 NA 48872737 276497 17838660 149064.4 757521.3
4 NA 48872737 259775 17838660 139511.0 711707.9
5 NA 48872737 251167 17838660 130601.9 688124.5

unnest_auto and unnest_longer to unnest multiple columns

I have a nested dataframe that I'm trying to unnest. Here's a fake example of the structure.
df <- structure(list(`_id` = c("a", "b", "c", "d"),
variable = list(structure(list(type = c("u", "a", "u", "a", "u", "a", "a"),
m_ = c("m1",
"m2",
"m3",
"m4",
"m5",
"m6", "m7"), #omitted from original example by mistake
t_ = c("2015-07-21 4:13 PM",
"2016-04-21 7:25 PM",
"2017-10-04 9:49 PM",
"2018-12-04 12:29 PM",
"2019-04-20 20:20 AM",
"2016-05-20 12:00 AM",
"2016-06-20 12:00 AM"),
a_ = c(NA,
"",
NA,
"",
NA,
"C",
"C")),
class = "data.frame",
row.names = c(NA, 7L)),
structure(list(type = c("u", "a"),
m_ = c("m1",
"m2"),
t_ = c("2018-05-24 12:08 AM",
"2019-04-24 3:05 PM"),
a_ = c(NA, "")),
class = "data.frame",
row.names = 1:2),
structure(list(type = "u",
m_ = "m1",
t_ = "2018-02-17 3:14 PM"),
class = "data.frame",
row.names = 1L),
structure(list(type = "u",
m_ = "m1",
t_ = "2016-05-27 5:14 PM",
b_ = "b1",
i_ = "i1",
e_ = structure(list(),
.Names = character(0),
class = "data.frame",
row.names = c(NA, -1L)),
l_ = "l1"),
class = "data.frame",
row.names = 1L)),
myDate = structure(c(1521503311.992,
1521514011.161,
1551699584.65,
1553632693.94),
class = c("POSIXct", "POSIXt"))),
row.names = c(1L, 2L, 3L, 4L),
class = "data.frame")
View(df)
variable is a list of dataframes that vary in length (max fields is 7 in this example, but can expand over time).
I tried using the development version of tidyr to take advantage of the new unnest_auto() function.
# devtools::install_github("tidyverse/tidyr")
df2 <- unnest_auto(df, variable)
View(df2)
If I use unnest_longer on the result and specify one column like type I get it to expand.
df3 <- unnest_longer(df2, type)
I don't see any arguments to unnest_longer() that handle multiple columns. Is there a better approach?

Here, since you're unnesting a two dimensional structure (i.e. you want to change both the rows and columns), you can just use unnest:
library(tidyr)
df <- as_tibble(df)
df
#> # A tibble: 4 × 3
#> `_id` variable myDate
#> <chr> <list> <dttm>
#> 1 a <df [7 × 4]> 2018-03-19 18:48:31
#> 2 b <df [2 × 4]> 2018-03-19 21:46:51
#> 3 c <df [1 × 3]> 2019-03-04 05:39:44
#> 4 d <df [1 × 7]> 2019-03-26 15:38:13
df |>
unnest(variable)
#> # A tibble: 11 × 10
#> `_id` type m_ t_ a_ b_ i_ e_ l_ myDate
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <df[> <chr> <dttm>
#> 1 a u m1 2015-07-… <NA> <NA> <NA> <NA> 2018-03-19 18:48:31
#> 2 a a m2 2016-04-… "" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 3 a u m3 2017-10-… <NA> <NA> <NA> <NA> 2018-03-19 18:48:31
#> 4 a a m4 2018-12-… "" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 5 a u m5 2019-04-… <NA> <NA> <NA> <NA> 2018-03-19 18:48:31
#> 6 a a m6 2016-05-… "C" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 7 a a m7 2016-06-… "C" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 8 b u m1 2018-05-… <NA> <NA> <NA> <NA> 2018-03-19 21:46:51
#> 9 b a m2 2019-04-… "" <NA> <NA> <NA> 2018-03-19 21:46:51
#> 10 c u m1 2018-02-… <NA> <NA> <NA> <NA> 2019-03-04 05:39:44
#> 11 d u m1 2016-05-… <NA> b1 i1 l1 2019-03-26 15:38:13
If you did want to do it in two steps, you could take advantage of the fact that unnest_longer() now takes a tidyselect specification:
df |>
unnest_wider(variable) |>
unnest_longer(type:a_)
#> # A tibble: 11 × 10
#> `_id` type m_ t_ a_ b_ i_ e_ l_ myDate
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <df[> <chr> <dttm>
#> 1 a u m1 2015-07-… <NA> <NA> <NA> <NA> 2018-03-19 18:48:31
#> 2 a a m2 2016-04-… "" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 3 a u m3 2017-10-… <NA> <NA> <NA> <NA> 2018-03-19 18:48:31
#> 4 a a m4 2018-12-… "" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 5 a u m5 2019-04-… <NA> <NA> <NA> <NA> 2018-03-19 18:48:31
#> 6 a a m6 2016-05-… "C" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 7 a a m7 2016-06-… "C" <NA> <NA> <NA> 2018-03-19 18:48:31
#> 8 b u m1 2018-05-… <NA> <NA> <NA> <NA> 2018-03-19 21:46:51
#> 9 b a m2 2019-04-… "" <NA> <NA> <NA> 2018-03-19 21:46:51
#> 10 c u m1 2018-02-… <NA> <NA> <NA> <NA> 2019-03-04 05:39:44
#> 11 d u m1 2016-05-… <NA> b1 i1 l1 2019-03-26 15:38:13

This appears to work:
df %>% unnest_auto(variable) %>% unnest()
#Warning message:
#`cols` is now required.
#Please use `cols = c(type, m_, t_, a_, e_)`
df %>% unnest_auto(variable) %>% unnest(cols = c(type, m_, t_, a_, e_, l_))
# A tibble: 11 x 10
`_id` type m_ t_ a_ b_ i_ e_ l_ myDate
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <???> <chr> <dttm>
1 a u m1 2015-… NA NA NA NA NA 2018-03-20 02:48:31
2 a a m2 2016-… "" NA NA NA NA 2018-03-20 02:48:31
3 a u m3 2017-… NA NA NA NA NA 2018-03-20 02:48:31
4 a a m4 2018-… "" NA NA NA NA 2018-03-20 02:48:31
5 a u m5 2019-… NA NA NA NA NA 2018-03-20 02:48:31
6 a a m6 2016-… C NA NA NA NA 2018-03-20 02:48:31
7 a a m7 2016-… C NA NA NA NA 2018-03-20 02:48:31
8 b u m1 2018-… NA NA NA NA NA 2018-03-20 05:46:51
9 b a m2 2019-… "" NA NA NA NA 2018-03-20 05:46:51
10 c u m1 2018-… NA NA NA NA NA 2019-03-04 14:39:44
11 d u m1 2016-… NA b1 i1 NA l1 2019-03-26 23:38:13

Spread and Gather table return duplicated rows with NA values

I have a table with categories and sub categories encoded in this format of columns name:
Date| Admissions__0 |Attendance__0 |Tri_1__0|Tri_2__0|...
Tri_1__1|Tri_2__1|...|
and I would like to change it to this format of columns using spread and gather function of tidyverse:
Date| Country code| Admissions| Attendance| Tri_1|Tri_2|...
I tried a solution posted but the outcome actually return multiple rows with NA rather than a single row.
My code used:
temp <- data %>% gather(key="columns",value ="dt",-Date)
temp <- temp %>% mutate(category = gsub(".*__","",columns)) %>% mutate(columns = gsub("__\\d","",columns))
temp %>% mutate(row = row_number()) %>% spread(key="columns",value="dt")
And my results is:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 NA 209 NA NA NA NA NA
2 01-APR-2014 0 640 84 NA NA NA NA NA NA
3 01-APR-2014 0 1005 NA NA 5 NA NA NA NA
4 01-APR-2014 0 1370 NA NA NA 33 NA NA NA
5 01-APR-2014 0 1735 NA NA NA NA 62 NA NA
6 01-APR-2014 0 2100 NA NA NA NA NA 80 NA
7 01-APR-2014 0 2465 NA NA NA NA NA NA 29
8 01-APR-2014 1 2830 NA 138 NA NA NA NA NA
9 01-APR-2014 1 3195 66 NA NA NA NA NA NA
10 01-APR-2014 1 3560 NA NA N/A NA NA NA NA
My expected results:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 84 209 5 33 62 80 29
8 01-APR-2014 1 2830 66 138 66 ... ... ... ...

We can do a summarise_at coalesce to remove the NA elements after the spread
library(tidyverse)
data %>%
gather(key = "columns", val = "dt", -Date, na.rm = TRUE) %>%
mutate(category = gsub(".*__","",columns)) %>%
mutate(columns = gsub("__\\d","",columns)) %>%
group_by(Date, dt, columns, category) %>%
mutate(rn = row_number()) %>%
spread(columns, dt) %>%
select(-V1) %>%
summarise_at(vars(Admissions:Tri_5),list(~ coalesce(!!! .))) # %>%
# filter if needed
#filter_at(vars(Admissions:Tri_5), all_vars(!is.na(.)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

gather function in R to match patterns in character strings - r

Use starts_with: test1 %>% gather(age_bucket, count, c(starts_with("m"), starts_with("f")))

Related

Convert semi-long data into wide data

Collapse data frame so NAs are removed

Add new columns with custom function using mutate

unnest_auto and unnest_longer to unnest multiple columns

Spread and Gather table return duplicated rows with NA values

Categories

Resources