pivot_longer into several pairs of columns - r

I'm needing to pivot_longer across multiple groups of columns, creating multiple names--values pairs.
For instance, I need to go from something like this:
df_raw <- tribble(
~id, ~belief_dog, ~belief_bull_frog, ~belief_fish, ~age, ~norm_bull_frog, ~norm_fish, ~norm_dog, ~gender,
"b2x8", 1, 4, 3, 41, 4, 2, 10, 2,
"m89w", 3, 6, 2, 19, 1, 2, 3, 1,
"32x8", 1, 5, 2, 38, 9, 1, 8, 3
)
And turn it into something lie this:
df_final <- tribble(
~id, ~belief_animal, ~belief_rating, ~norm_animal, ~norm_rating, ~age, ~gender,
"b2x8", "dog", 1, "bull_frog", 4, 41, 2,
"b2x8", "bull_frog", 4, "fish", 2, 41, 2,
"b2x8", "fish", 3, "dog", 10, 41, 2,
"m89w", "dog", 3, "bull_frog", 1, 19, 1,
"m89w", "bull_frog", 6, "fish", 2, 19, 1,
"m89w", "fish", 2, "dog", 3, 19, 1,
"32x8", "dog", 1, "bull_frog", 9, 38, 3,
"32x8", "bull_frog", 5, "fish", 1, 38, 3,
"32x8", "fish", 2, "dog", 8, 38, 3
)
In other words, anything starting with "belief_" should get pivoted in one names--values pair & anything starting with "norm_" should be pivoted into another names--values pair.
I tried looking at several other Stack Overflow pages with somewhat related content but wasn't able to translate those solutions to this situation.
Any help would be appreciated, with a strong preference for dplyr solutions.
THANKS!

With tidyverse, we can pivot on the two sets of columns that starts with belief and norm. We can then use regex to split into groups according to the first underscore (since some column names have multiple underscores). Essentially, we are saying to put belief or norm (the first group in the column name) into their own columns (i.e., .value), then the second part of the group (i.e., animal names) are put into one column named animal.
library(tidyverse)
df_raw %>%
pivot_longer(cols = c(starts_with("belief"), starts_with("norm")),
names_to = c('.value', 'animal'),
names_pattern = '(.*?)_(.*)') %>%
rename(belief_rating = belief, norm_rating = norm)
Output
id age gender animal belief_rating norm_rating
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 b2x8 41 2 dog 1 10
2 b2x8 41 2 bull_frog 4 4
3 b2x8 41 2 fish 3 2
4 m89w 19 1 dog 3 3
5 m89w 19 1 bull_frog 6 1
6 m89w 19 1 fish 2 2
7 32x8 38 3 dog 1 8
8 32x8 38 3 bull_frog 5 9
9 32x8 38 3 fish 2 1

Solved it with a bit more experimentation!
The key comes down to both the names_to & the names_pattern arguments.
df_raw %>% pivot_longer(
cols = c(belief_dog:belief_fish, norm_bull_frog:norm_dog),
names_to = c(".value", "rating"),
names_pattern = "([a-z]+)_*(.+)"
)
I don't really understand how ".value" or the regex "([a-z]+)_*(.+)" work, but the solution works nonetheless.

For these data:
library(dplyr)
library(tidyr)
df_raw %>%
pivot_longer(
cols = -c(id, age, gender),
names_to = "name1",
values_to = "belief_rating"
) %>%
separate(name1, c("A", "B"), sep = '\\_' , extra = 'merge') %>%
group_by(id) %>%
mutate(helper = rep(row_number(), each=3, length.out = n())) %>%
pivot_wider(
names_from = A,
values_from = B,
names_glue = "{A}_animal"
) %>%
mutate(norm_rating = ifelse(helper == 1, lead(belief_rating, 3), NA),
norm_animal = ifelse(helper == 1, lead(norm_animal, 3), NA)) %>%
slice(1:3) %>%
select(id, belief_animal, belief_rating, norm_animal, norm_rating, age, gender)
id belief_animal belief_rating norm_animal norm_rating age gender
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 32x8 dog 1 bull_frog 9 38 3
2 32x8 bull_frog 5 fish 1 38 3
3 32x8 fish 2 dog 8 38 3
4 b2x8 dog 1 bull_frog 4 41 2
5 b2x8 bull_frog 4 fish 2 41 2
6 b2x8 fish 3 dog 10 41 2
7 m89w dog 3 bull_frog 1 19 1
8 m89w bull_frog 6 fish 2 19 1
9 m89w fish 2 dog 3 19 1

Related

How to pivot_longer with prefix to ID column

My data often contain either "Left/Right" or "Pre/Post" prefixes without separators in a wide format that I need to pivot to tall format combining variables by these prefixes. I have a work around of using "gsub()" to insert a separator ("_" or ".") into the column names. "pivot_longer" then does what I want with the "names_sep" argument. I'm wondering though if there is a way to make this work more directly with "pivot_longer" "names" syntax ("names_prefix", "names_pattern", "names_to"). Here is what I am attempting:
Original wide format example:
HW <- tribble(
~Subject, ~LeftV1, ~RightV1, ~LeftV2, ~RightV2, ~LeftV3, ~RightV3,
"A", 0, 1, 10, 11, 100, 101,
"B", 2, 3, 12, 13, 102, 103,
"C", 4, 5, 14, 15, 104, 105)
Desired tall format:
HWT <- tribble(
~Subject, ~Side, ~V1, ~V2, ~V3,
"A", "Left", 0, 10, 100,
"A", "Right", 1, 11, 101,
"B", "Left", 2, 12, 102,
"B", "Right", 3, 13, 103,
"C", "Left", 4, 14, 104,
"C", "Right", 5, 15, 105)
I've tried various iterations of syntax that look more or less like this:
HWT <- HW %>% pivot_longer(
cols = contains(c("Left", "Right")),
names_pattern = "/^(Left|Right)",
names_to = c('Side', '.value') )
or this:
HWT <- HW %>% pivot_longer(
cols = contains(c("Left", "Right")),
names_prefix = "/^(Left|Right)",
names_to = c('Side', '.value') )
Each of which give syntax errors that I am unsure how to resolve.
We could use
library(tidyr)
library(dplyr)
HW %>%
pivot_longer(cols = -Subject, names_to = c("Side", ".value"),
names_pattern = "^(Left|Right)(.*)")
# A tibble: 6 × 5
Subject Side V1 V2 V3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Left 0 10 100
2 A Right 1 11 101
3 B Left 2 12 102
4 B Right 3 13 103
5 C Left 4 14 104
6 C Right 5 15 105
Here is a similar approach concerning pivot_longer but with another strategy. I find it easier to understand if we could a simple separate like _. For this we could use rename_with and str_replace before pivoting:
librayr(dplyr)
library(stringr)
HW %>%
rename_with(., ~str_replace_all(., 'V', '_V')) %>%
pivot_longer(-Subject,
names_to =c("Side", ".value"),
names_sep ="_")
# A tibble: 6 x 5
Subject Side V1 V2 V3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Left 0 10 100
2 A Right 1 11 101
3 B Left 2 12 102
4 B Right 3 13 103
5 C Left 4 14 104
6 C Right 5 15 105

manipulate a pair data in R

I would like to reshape the data sample below, so that to get the output like in the table. How can I reach to that? the idea is to split the column e into two columns according to the disease. Those with disease 0 in one column and those with disease 1 in the other column. thanks in advance.
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), fid = c(1,
1, 2, 2, 3, 3, 4, 4, 5, 5), disease = c(0, 1, 0, 1, 1, 0, 1, 0, 0,
1), e = c(3, 2, 6, 1, 2, 5, 2, 3, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
library(tidyverse)
df %>%
pivot_wider(fid, names_from = disease, values_from = e, names_prefix = 'e') %>%
select(-fid)
e0 e1
<dbl> <dbl>
1 3 2
2 6 1
3 5 2
4 3 2
5 1 1
if you want the e1,e2 you could do:
df %>%
pivot_wider(fid, names_from = disease, values_from = e,
names_glue = 'e{disease + 1}') %>%
select(-fid)
# A tibble: 5 x 2
e1 e2
<dbl> <dbl>
1 3 2
2 6 1
3 5 2
4 3 2
5 1 1
We could use lead() combined with ìfelse statements for this:
library(dplyr)
df %>%
mutate(e2 = lead(e)) %>%
filter(row_number() %% 2 == 1) %>%
mutate(e1 = ifelse(disease==1, e2,e),
e2 = ifelse(disease==0, e2,e)) %>%
select(e1, e2)
e1 e2
<dbl> <dbl>
1 3 2
2 6 1
3 5 2
4 3 2
5 1 1

dplyr solution: absolute difference of two values in one column matched by other column

I have a dataframe that looks like this, but there will be many more IDs:
# Groups: ID [1]
ID ARS stim
<int> <int> <chr>
1 3 0 1
2 3 4 2
3 3 2 3
4 3 3 4
5 3 1 5
6 3 0 6
7 3 2 10
8 3 4 11
9 3 0 12
10 3 3 13
11 3 2 14
12 3 2 15
I would like to calculate the sum of the absolute difference abs() between the values in ARS, e.g. for stim=1 and stim=10 plus for stim=2 and stim=11 and so on.
Any good solutions are appreciated!
The desired output calculation is:
abs(0-2) + abs(4-4) + abs(2-0) + abs(3-3) + abs(1-2) + abs(0-2)
Hence, 2+0+2+0+1+2
Output for ID==3: 7
A possible solution:
library(dplyr)
df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), ARS = c(0, 4, 2, 3, 1, 0, 2, 4, 0, 3, 2, 2), stim = c(1, 2, 3, 4, 5, 6,
10, 11, 12, 13, 14, 15)), row.names = c(NA, -12L), class = "data.frame")
df %>%
group_by(ID) %>%
summarise(value = abs(ARS[which(stim == 1:6)] - ARS[which(stim == 9+1:6)]),
.groups = "drop") %>%
pull(value) %>% sum
#> [1] 7

Is there a way to create new columns in R based on manipulations from multiple data frames?

Does anyone know if it is possible to use a variable in one dataframe (in my case the "deploy" dataframe) to create a variable in another dataframe?
For example, I have two dataframes:
df1:
deploy <- data.frame(ID = c("20180101_HH1_1_1", "20180101_HH1_1_2", "20180101_HH1_1_3"),
Site_Depth = c(42, 93, 40), Num_Depth_Bins_Required = c(5, 100, 4),
Percent_Column_in_each_bin = c(20, 10, 25))
df2:
sp.c <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42))
I want to create new columns in df2 that assigns each species and count to a bin based on the Percent_Column_in_each_bin for each ID. For example, in 20180101_HH1_1_3 there would be 4 bins that each make up 25% of the column and all species that are within 0-25% of the column (in df2) would be in bin 1 and species within 25-50% of the column would be in depth bin 2, and so on. What I'm imagining this looking like is:
i.want.this <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42),
'20180101_HH1_1_1_Bin' = c(1, 1, 2, 4, 4, 5, 1, 4, 1, 3),
'20180101_HH1_1_2_Bin' = c(2, 2, 4, 7, 8, 10, 1, 7, 1, 5),
'20180101_HH1_1_3_Bin' = c(1, 1, 2, 3, 3, 4, 1, 3, 1, 2))
I am pretty new to R and I'm not sure how to make this happen. I need to do this for over 100 IDs (all with different depths, number of depth bins, and percent of the column in each bin) so I was hoping that I don't need to do them all by hand. I have tried mutate in dplyr but I can't get it to pull from two different dataframes. I have also tried ifelse statements, but I would need to run the ifelse statement for each ID individually.
I don't know if what I am trying to do is possible but I appreciate the feedback. Thank you in advance!
Edit: my end goal is to find the max count (max ct) for each species within each bin for each ID. What I've been doing to find this (using the bins generated with suggestions from #Ben) is using dplyr to slice and find the max ID like this:
20180101_HH1_1_1 <- sp.c %>%
group_by(20180101_HH1_1_1, species) %>%
arrange(desc(ct)) %>%
slice(1) %>%
group_by(20180101_HH1_1_1) %>%
mutate(Count_Total_Per_Bin = sum(ct)) %>%
group_by(species, add=TRUE) %>%
mutate(species_percent_of_total_in_bin =
paste0((100*ct/Count_Total_Per_Bin) %>%
mutate(ID= "20180101_HH1_1_1 ") %>%
ungroup()
but I have to do this for over 100 IDs. My desired output would be something like:
end.goal <- data.frame(ID = c(rep("20180101_HH1_1_1", 8)),
species = c("RR", "GS", "SH", "GT", "RR", "BR", "RS", "BA"),
bin = c(1, 1, 1, 2, 3, 4, 4, 5),
Max_count_of_each_species_in_each_bin = c(11, 66, 500, 1, 6, 12, 30, 6),
percent_dist_from_surf = c(11, 15, 5, 33, 42, 68, 71, 100),
percent_each_species_max_in_each_bin = c((11/577)*100, (66/577)*100, (500/577)*100, 100, 100, (12/42)*100, (30/42)*100, 100))
I was thinking that by answering the original question I could get to this but I see now that there's still a lot you have to do to get this for each ID.
Here is another approach, which does not require a loop.
Using sapply you can cut to determine bins for each percent_dist_from_surf value in your deploy dataframe.
res <- sapply(deploy$Percent_Column_in_each_bin, function(x) {
cut(sp.c$percent_dist_from_surf, seq(0, 100, by = x), include.lowest = TRUE, labels = 1:(100/x))
})
colnames(res) <- deploy$ID
cbind(sp.c, res)
Or using purrr:
library(purrr)
cbind(sp.c, imap(setNames(deploy$Percent_Column_in_each_bin, deploy$ID),
~ cut(sp.c$percent_dist_from_surf, seq(0, 100, by = .x), include.lowest = TRUE, labels = 1:(100/.x))
))
Output
species ct percent_dist_from_surf 20180101_HH1_1_1 20180101_HH1_1_2 20180101_HH1_1_3
1 RR 25 11 1 2 1
2 GS 66 15 1 2 1
3 GT 1 33 2 4 2
4 BR 12 68 4 7 3
5 RS 30 71 4 8 3
6 BA 6 100 5 10 4
7 GS 1 2 1 1 1
8 RS 22 65 4 7 3
9 SH 500 5 1 1 1
10 RR 6 42 3 5 2
Edit:
To determine the maximum ct value for each species, site, and bin, put the result of above into a dataframe called res and do the following.
First would put into long form with pivot_longer. Then you can group_by species, site, and bin, and determine the maximum ct for this combination.
library(tidyverse)
res %>%
pivot_longer(cols = starts_with("2018"), names_to = "site", values_to = "bin") %>%
group_by(species, site, bin) %>%
summarise(max_ct = max(ct)) %>%
arrange(site, bin)
Output
# A tibble: 26 x 4
# Groups: species, site [21]
species site bin max_ct
<fct> <chr> <fct> <dbl>
1 GS 20180101_HH1_1_1 1 66
2 RR 20180101_HH1_1_1 1 25
3 SH 20180101_HH1_1_1 1 500
4 GT 20180101_HH1_1_1 2 1
5 RR 20180101_HH1_1_1 3 6
6 BR 20180101_HH1_1_1 4 12
7 RS 20180101_HH1_1_1 4 30
8 BA 20180101_HH1_1_1 5 6
9 GS 20180101_HH1_1_2 1 1
10 SH 20180101_HH1_1_2 1 500
11 GS 20180101_HH1_1_2 2 66
12 RR 20180101_HH1_1_2 2 25
13 GT 20180101_HH1_1_2 4 1
14 RR 20180101_HH1_1_2 5 6
15 BR 20180101_HH1_1_2 7 12
16 RS 20180101_HH1_1_2 7 22
17 RS 20180101_HH1_1_2 8 30
18 BA 20180101_HH1_1_2 10 6
19 GS 20180101_HH1_1_3 1 66
20 RR 20180101_HH1_1_3 1 25
21 SH 20180101_HH1_1_3 1 500
22 GT 20180101_HH1_1_3 2 1
23 RR 20180101_HH1_1_3 2 6
24 BR 20180101_HH1_1_3 3 12
25 RS 20180101_HH1_1_3 3 30
26 BA 20180101_HH1_1_3 4 6
It is helpful to distinguish between the contents of your two dataframes.
df2 appears to contain measurements from some sites
df1 appears to contain parameters by which you want to process/summarise the measurements in df2
Given these different purposes of the two dataframes, your best approach is probably to loop over all the rows of df1 each time adding a column to df2. Something like the following:
max_dist = max(df2$percent_dist_from_surf)
for(ii in 1:nrow(df1)){
# extract parameters
this_ID = df1[[ii,"ID"]]
this_depth = df1[[ii,"Site_Depth"]]
this_bins = df1[[ii,"Num_Depth_Bins_Required"]]
this_percent = df1[[ii,"Percent_Column_in_each_bin"]]
# add column to df2
df2 = df2 %>%
mutate(!!sym(this_ID) := insert_your_calculation_here)
}
The !!sym(this_ID) := part of the code is to allow dynamic naming of your output columns.
And as best I can determine the formula you want for insert_your_calculation_here is ceil(percent_dist_from_surf / max_dist * this_bins)

How to pivot_longer a set of multiple columns? and How to go back from that long format to original wide?

If I have the following data:
D = tibble::tribble(
~firm, ~ind, ~var1_1, ~var1_2, ~op2_1, ~op2_2,
"A", 1, 10, 11, 11, 12,
"A", 2, 12, 13, 13, 14,
"B", 1, 14, 15, 15, 16,
"B", 2, 16, 17, 17, 18,
"C", 1, 18, 19, 19, 20,
"C", 2, 20, 21, 21, 22,
)
How can I pivot_longer() var1 and var2 having "_*" as year indicator?
I mean, I would like have something like this:
D %>%
pivot_longer(var1_1:op2_2,
names_to = c(".value", "year"),
names_pattern = "(.*)_(.*)",
values_to = c("var1, var2")
)
# A tibble: 12 x 5
firm ind year var1 op2
<chr> <dbl> <chr> <dbl> <dbl>
1 A 1 1 10 11
2 A 1 2 11 12
3 A 2 1 12 13
4 A 2 2 13 14
5 B 1 1 14 15
6 B 1 2 15 16
7 B 2 1 16 17
8 B 2 2 17 18
9 C 1 1 18 19
10 C 1 2 19 20
11 C 2 1 20 21
12 C 2 2 21 22
I'm achieving the desired result using the code above. However in my real case I'm dealing with more than 30 variables and 10 years. Then, using values_to isn't practical and clean. I'd like the code read first part of variable name as the desired new variable name. Since initially all columns to be pivoted are structured like "varname_year".
Besides, once I get the new data format into long, I might need to go back to wide-format keeping the initial data structure.
We can use one of the select_helpers
library(dplyr)
library(tidyr)
library(stringr)
Dlong <- D %>%
pivot_longer(cols = starts_with('var'),
names_to = c(".value", "year"), names_sep = "_")
From the 'long' format, change to 'wide' with pivot_wider
Dlong %>%
pivot_wider(names_from = ind, values_from = str_c("var", 1:2))

Resources