tidyverse: data transformation, gather() - r

I am trying to transform a dataset:
[1]: https://i.stack.imgur.com/09Ioo.png
To something like this:
[2]: https://i.stack.imgur.com/vKKu2.png
How can I do this on R? I tried using gather() but somehow im not getting the results..
library(tidyverse)
df_gather <- df %>% gather(key = "Day", "Sensor",2:5)
View(df_gather)
Thanks in advance for your help!

Here is another tidyverse approach:
dat %>%
rename_with(., ~str_replace_all(., "Sensor", "Time_")) %>%
pivot_longer(-Date,
names_sep = "_",
names_to = c(".value", "Sensor")
)
Date Sensor Time
<int> <chr> <dbl>
1 1 1 155.
2 1 2 160.
3 1 3 126.
4 1 4 162.
5 1 5 155.
6 2 1 126.
7 2 2 133.
8 2 3 155.
9 2 4 171.
10 2 5 154.
# … with 15 more rows

Because you did not provide the data in an easily reused form, here is a dummy data frame similar to yours:
dat <-structure(list(Date = 1:5, Sensor1 = c(154.501112480648, 125.564142037183,
184.578892146237, 155.085407197475, 176.232917583548), Sensor2 = c(159.958130051382,
132.943481742404, 100.740377581678, 178.590174368583, 182.851045904681
), Sensor3 = c(125.962588260882, 155.333150480874, 122.294128965586,
122.685094899498, 150.199430575594), Sensor4 = c(162.315403693356,
170.65782523714, 117.775949183851, 145.122508681379, 193.589874636382
), Sensor5 = c(154.887120774947, 154.432400292717, 139.244429254904,
180.038237478584, 160.314362798817)), class = "data.frame", row.names = c(NA,
-5L))
To transform the data to the form you showed, you can use pivot_longer (which superseded gather) and then changed the names as necessary.
dat |>
pivot_longer(cols = starts_with("Sensor")) |>
mutate(name = str_replace(name, "Sensor", "")) |>
rename(Day = Date, Sensor = name, Time = value)
# The result
# A tibble: 25 × 3
Day Sensor Time
<int> <chr> <dbl>
1 1 1 155.
2 1 2 160.
3 1 3 126.
4 1 4 162.
5 1 5 155.
6 2 1 126.
7 2 2 133.
8 2 3 155.
9 2 4 171.
10 2 5 154.
# … with 15 more rows

Related

Recode multiple columns to numbers increasingly in R

I have 50 columns of names, but here I have presented only 4 columns for convenience.
Name1 Name2 Name3 Name4
Rose,Ali Van,Hall Ghol,Dam Murr,kate
Camp,Laura Ka,Klo Dan,Dan Ali,Hoss
Rose,Ali Van,Hall Ghol,Dam Kol,Kan
Murr,Kate Ismal, Ismal Sian,Rozi Nas,Ami
Ghol,Dam Ka,Klo Rose,Ali Nor,Ko
Murr,Kate Ismal, Ismal Dan,Dan Nas,Ami
I want to assign numbers to each person based on the columns, a sequence of numbers.
For example, in Name 1, we get the numbers from 1-4. The repeated names will get the same numbers.
In Name 2, it should be started from 5 and so on. This will give me the following table:
Assign1 Assian2 Assian3 Assian4
1 5 8 12
2 6 9 13
1 5 8 14
3 7 10 15
4 6 11 17
3 7 9 15
I would like to have it without a loop, i.e.,sapply,i.e., sapply(dat, function(x) match(x, unique(x))).
Using dplyr or tidyverse would be great.
A tidyverse solution with purrr::accumulate():
library(tidyverse)
df %>%
mutate(as_tibble(
accumulate(across(Name1:Name4, ~ match(.x, unique(.x))), ~ .y + max(.x))
))
# Name1 Name2 Name3 Name4
# 1 1 5 8 12
# 2 2 6 9 13
# 3 1 5 8 14
# 4 3 7 10 15
# 5 4 6 11 16
# 6 3 7 9 15
Because the values in each column depend on the values in the previous column, the calculations have to be done sequentially. This is probably most succinctly achieved by a loop. Remember that lapply and sapply are simply loops-in-disguise, and won't be quicker than an explicit loop.
Note that your expected output has a mistake in it (there is a number 17 which should be 16)
output <- setNames(df, paste0('Assign', seq_along(df)))
for(i in seq_along(output)) {
output[[i]] <- match(output[[i]], unique(output[[i]]))
if(i > 1) output[[i]] <- output[[i]] + max(output[[i - 1]])
}
output
#> Assign1 Assign2 Assign3 Assign4
#> 1 1 5 8 12
#> 2 2 6 9 13
#> 3 1 5 8 14
#> 4 3 7 10 15
#> 5 4 6 11 16
#> 6 3 7 9 15
Edit
If you really want it without an explicit loop, you can do:
res <- sapply(seq_along(df), \(i) match(df[[i]], unique(df[[i]])))
res + t(replicate(nrow(df), head(c(0, cumsum(apply(res, 2, max))), -1))) |>
as.data.frame() |>
setNames(paste0('Assign', seq_along(df)))
#> Assign1 Assign2 Assign3 Assign4
#> 1 1 5 8 12
#> 2 2 6 9 13
#> 3 1 5 8 14
#> 4 3 7 10 15
#> 5 4 6 11 16
#> 6 3 7 9 15
Created on 2023-01-13 with reprex v2.0.2
Data taken from question in reproducible format
df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali",
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall",
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi",
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan",
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L),
class = "data.frame")
Here is a tidyverse approach:
First paste the column name after each of the strings in all your columns, for sorting purpose later. Then pivot it into a two-column df so that we can assign ID to them by match. Finally pivot it back to a wide format and unnest the list columns.
library(tidyverse)
df %>%
mutate(across(everything(), ~ paste0(.x, "_", cur_column()))) %>%
pivot_longer(everything(), names_to = "ab", values_to = "a") %>%
arrange(ab) %>%
mutate(b = match(a, unique(a)), .keep = "unused") %>%
pivot_wider(names_from = "ab", values_from = "b") %>%
unnest(everything())
# A tibble: 6 × 4
Name1 Name2 Name3 Name4
<int> <int> <int> <int>
1 1 5 8 12
2 2 6 9 13
3 1 5 8 14
4 3 7 10 15
5 4 6 11 16
6 3 7 9 15
Data
Taken from #Allan Cameron.
df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali",
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall",
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi",
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan",
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L),
class = "data.frame")
Update: The approach below is not ideal because ID's are not unique. Sorry.
Using a lookup table with tidyverse:
library(dplyr)
library(tidyr)
lookup <-
df |>
pivot_longer(everything()) |>
distinct() |>
arrange(name) |>
transmute(name = value, value = row_number()) |>
deframe()
df |>
mutate(across(everything(), ~ recode(., !!!lookup)))
Output:
Name1 Name2 Name3 Name4
1 1 5 4 12
2 2 6 9 13
3 1 5 4 14
4 3 7 10 15
5 4 6 1 16
6 3 7 9 15
Data from #Allan Cameron, thanks.

Tidyverse column-wise differences

Suppose I have a data frame like this:
df = data.frame(preA = c(1,2,3),preB = c(3,4,5),postA = c(6,7,8),postB = c(9,8,4))
I want to add columns having column-wise differences, that is:
diffA = postA - preA
diffB = postB - preB
and so on...
Is there an efficient way to do this in tidyverse?
The way to go with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:n()) %>%
pivot_longer(-id,
names_to = c("pre_post", ".value"),
names_pattern = "(pre|post)(.*)") %>%
group_by(id) %>%
mutate(across(A:B, diff, .names = "diff{col}")) %>%
pivot_wider(names_from = pre_post, values_from = c(A, B),
names_glue = '{pre_post}{.value}') %>%
select(id, starts_with("pre"), starts_with("post"), starts_with("diff"))
# id preA preB postA postB diffA diffB
# 1 1 1 3 6 9 5 6
# 2 2 2 4 7 8 5 4
# 3 3 3 5 8 4 5 -1
A shorter but less flexible was with dplyover::across2:
library(dplyr)
library(dplover)
df %>%
#relocate(sort(colnames(.))) %>%
mutate(across2(starts_with("post"), starts_with("pre"), `-`,
.names = "diff{idx}"))
# preA preB postA postB diff1 diff2
# 1 1 3 6 9 5 6
# 2 2 4 7 8 5 4
# 3 3 5 8 4 5 -1
You can do this with two uses of across(), creating new variables with the first use and subtracting the second. This also assumes your columns are in order.
df %>%
mutate(across(starts_with("post"), .names = "diff{sub('post', '', .col)}") - across(starts_with("pre")))
preA preB postA postB diffA diffB
1 1 3 6 9 5 6
2 2 4 7 8 5 4
3 3 5 8 4 5 -1
A few more solutions. My favourite is the first one demonstrated here - I think it's the cleanest and most debuggable:
# Setup:
library(dplyr, warn.conflicts = FALSE)
library(glue)
df <- data.frame(
preA = c(1,2,3),
preB = c(3,4,5),
postA = c(6,7,8),
postB = c(9,8,4)
)
Method 1: Using expressions:
This is my favourite approach. I think it's very readable, and I think it should be reasonably fast compared to solutions using across():
cols <- c("A", "B")
exprs <- glue("post{cols} - pre{cols}")
names(exprs) <- glue("diff{cols}")
df |>
mutate(!!!rlang::parse_exprs(exprs))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Method 2: Using mutate() + across() + get():
Personally, I don't like this sort of thing because I think it's really hard to read:
df |>
mutate(across(
starts_with("post"),
~ .x - get(stringr::str_replace_all(cur_column(), "^post", "pre")),
.names = "diff{stringr::str_remove(.col, '^post')}"
))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Method 3: Using base subsetting:
The main advantage here is that you don't need any packages (you can use paste0() instead of glue()), IMO it's also pretty readable. But I don't like that it doesn't play well with |>:
cols <- c("A", "B")
df2 <- df
df2[glue("diff{cols}")] <- df2[glue("post{cols}")] - df2[glue("pre{cols}")]
df2
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1

apply function or loop within mutate

Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13

Use tidyr::pivot_longer for multiple measurements with uncertainties

A common type of data set I come across contains several measurements with associated uncertainties combined in each row. Here's an example:
structure(list(meas1 = c(150.3197, 19.95853, 161.40022, 103.23733, 140.28786, 193.42983, 75.237556, 207.84688, 116.4379, 80.251797 ), unc1 = c(0.038140954, 0.09151666, 0.035390881, 0.043274285, 0.03396304, 0.033362432, 0.05290015, 0.035449262, 0.038330437, 0.049171039), meas2 = c(1270.5522, 562.92518, 940.65152, 696.6982, 380.22449, 1979.0521, 1022.01, 1269.7508, 1686.6116, 1256.0033 ), unc2 = c(0.06063558, 0.061388181, 0.060714985, 0.061178737, 0.061318833, 0.060302475, 0.060876815, 0.060659146, 0.060412551, 0.060635459), meas3 = c(601.11331, 1675.2958, 608.84736, 998.76837, 266.2926, 2933.9751, 1682.3191, 775.43699, 428.29473, 1393.6564 ), unc3 = c(0.103445147, 0.102309634, 0.103147224, 0.101772166, 0.104186185, 0.101292496, 0.101556363, 0.102983978, 0.10394405, 0.101598249), ID = 1:10), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
I want to get it in a tidy configuration, like this:
ID meas_type reading uncert
1 1 meas1 150.31970 0.03814095
2 1 meas2 1270.55220 0.06063558
3 1 meas3 601.11331 0.10344515
4 2 meas1 19.95853 0.09151666
5 2 meas2 562.92518 0.06138818
6 2 meas3 1675.29580 0.10230963 ...
I have a work-around, but am wondering if there isn't a pivot_longer() method that would do this more elegantly.
Here's my klugey solution:
df_vals <- df_raw %>%
pivot_longer(cols = c("meas1", "meas2", "meas3"),
names_to = "meas_type",
values_to = "reading")
df_vals <- df_vals[, 4:6]
df_unc <- df_raw %>%
pivot_longer(cols = starts_with("unc"),
values_to = "uncert")
df_unc <- df_unc[, 4:6]
df <- cbind(df_vals, "uncert" = df_unc$uncert)
We can use names_pattern argument of pivot_longer.
tidyr::pivot_longer(df, cols = -ID,
names_to = c(".value", "meas_type"),
names_pattern = "(.*)(\\d+)")
# A tibble: 30 x 4
# ID meas_type meas unc
# <int> <chr> <dbl> <dbl>
# 1 1 1 150. 0.0381
# 2 1 2 1271. 0.0606
# 3 1 3 601. 0.103
# 4 2 1 20.0 0.0915
# 5 2 2 563. 0.0614
# 6 2 3 1675. 0.102
# 7 3 1 161. 0.0354
# 8 3 2 941. 0.0607
# 9 3 3 609. 0.103
#10 4 1 103. 0.0433
# … with 20 more rows
In case you would consider a base R solution, you would need to use a data frame not a tibble but this does what you want..
d <- as.data.frame(d)
reshape(data=d, varying=1:6,
timevar="meas_type",
direction="long",
sep="")
ID meas_type meas unc
1.1 1 1 150.31970 0.03814095
2.1 2 1 19.95853 0.09151666
3.1 3 1 161.40022 0.03539088
4.1 4 1 103.23733 0.04327429
5.1 5 1 140.28786 0.03396304
6.1 6 1 193.42983 0.03336243
We can use melt from data.table
library(data.table)
melt(setDT(df1), measure = patterns("^unc", "meas"),
value.name = c("unc", "meas"), variable.name = "meas_type")
# ID meas_type unc meas
# 1: 1 1 0.03814095 150.31970
# 2: 2 1 0.09151666 19.95853
# 3: 3 1 0.03539088 161.40022
# 4: 4 1 0.04327429 103.23733
# 5: 5 1 0.03396304 140.28786
# 6: 6 1 0.03336243 193.42983
# 7: 7 1 0.05290015 75.23756
# 8: 8 1 0.03544926 207.84688
# 9: 9 1 0.03833044 116.43790
#10: 10 1 0.04917104 80.25180
#11: 1 2 0.06063558 1270.55220
#...

Is it possible to dynamically mutate columns (values in a column have other column names)

So I have one column of a dataframe which contains a value, which is equal to a different column name. For each row, I want to change the value of the column that is named.
df <- tibble(.rows = 6) %>% mutate(current_stage = c("Stage-1", "Stage-1", "Stage-2", "Stage-3", "Stage-4", "Stage-4"), `Stage-1` = c(1,1,1,2,4,5), `Stage-2` = c(40,50,20,10,15,10), `Stage-3` = c(1,2,3,4,5,6), `Stage-4` = c(NA, 1, NA, 2, NA, 3))
A tibble: 6 x 5
current_stage `Stage-1` `Stage-2` `Stage-3` `Stage-4`
<chr> <dbl> <dbl> <dbl> <dbl>
Stage-1 1 40 1 NA
Stage-1 1 50 2 1
Stage-2 1 20 3 NA
Stage-3 2 10 4 2
Stage-4 4 15 5 NA
Stage-4 5 10 6 3
So in the first row, I would want to edit the value in the Stage-1 column because the current_stage column has Stage-1. I've tried using !!rlang::sym:
df %>% mutate(!!rlang::sym(current_stage) := 15)
but I get the error: Error in is_symbol(x) : object 'current_stage' not found.
Is this even possible to do? Or should I just bite the bullet and write a different function?
Within the tidyverse, I think using a long format with gather is the easiest way as suggested by Jack Brookes:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(stage, value, -current_stage, -rowid) %>%
mutate(value = if_else(stage == current_stage, 15, value)) %>%
spread(stage, value)
#> # A tibble: 6 x 6
#> rowid current_stage `Stage-1` `Stage-2` `Stage-3` `Stage-4`
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Stage-1 15 40 1 NA
#> 2 2 Stage-1 15 50 2 1
#> 3 3 Stage-2 1 15 3 NA
#> 4 4 Stage-3 2 10 15 2
#> 5 5 Stage-4 4 15 5 15
#> 6 6 Stage-4 5 10 6 15
Created on 2019-05-20 by the reprex package (v0.2.1)

Resources