Issues with accent when using the "separate" function from tidyverse - r

I am using the separate function from tidyverse to split the first column of this tibble :
# A tibble: 6,951 x 9
Row.names Number_of_analysis~ DL_Minimum DL_Mean DL_Maximum Number_of_measur~ Measure_Minimum Measure_Mean Measure_Maximum
<I<chr>> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2011.FACILITY.PONT-À-CELLES 52 0.6 1.81 16 0 0 0 0
2 2011.FACILITY.PONT-À-CELLES 52 0.07 0.177 1.3 0 0 0 0
3 2011.FACILITY.CHARLEROI 52 0.07 0.212 1.9 0 0 0 0
4 2011.FACILITY.CHARLEROI 52 0.08 0.209 2 0 0 0 0
Merge_splitnames <- Merge %>%
separate(Row.names,sep = "\\.",into = c("Year", "Catchment", "Locality"), extra = "drop")
While everything seems correct, the output is a tibble without the first 2 columns (the ones which have a name comprising an accent in French) :
# A tibble: 6,951 x 9
Year Catchment Locality Number_of_analysis~ DL_Minimum DL_Mean DL_Maximum Number_of_measur~ Measure_Minimum Measure_Mean Measure_Maximum
<I<chr>> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
3 2011 FACILITY CHARLEROI 52 0.07 0.212 1.9 0 0 0 0
4 2011 FACILITY CHARLEROI 52 0.08 0.209 2 0 0 0 0
Any idea how to deal with this issue ? I wish to keep the real name in French (with the accent). This is quite surprising for me, I've never got any issue with all the other functions from tidyverse.
NB : this is a simple and reproducible example, my real tibble is about 100 times bigger

separate is retaining the accent for me:
library(tidyverse)
tribble(
~names,
"2011.FACILITY.PONT-À-CELLES",
"2011.FACILITY.PONT-À-CELLES",
"2011.FACILITY.CHARLEROI",
"2011.FACILITY.CHARLEROI"
) %>%
separate(names, sep = "\\.", into = c("Year", "Catchment", "Locality"))
#> # A tibble: 4 × 3
#> Year Catchment Locality
#> <chr> <chr> <chr>
#> 1 2011 FACILITY PONT-À-CELLES
#> 2 2011 FACILITY PONT-À-CELLES
#> 3 2011 FACILITY CHARLEROI
#> 4 2011 FACILITY CHARLEROI
Created on 2022-05-06 by the reprex package (v2.0.1)

Assuming DF shown reproducibly in the Note at the end, use extra = "merge" in separate . (It is possible that you may need to change your locale but I did not need to do that. Some things to try are shown in How to change the locale of R? or Using weekdays with any locale under Windows )
library(tidyr)
DF %>%
separate(Row.names, c("Year", "Catchment", "Locality"), extra = "merge")
giving:
Year Catchment Locality Number_of_analysis~ DL_Minimum DL_Mean
1 2011 FACILITY PONT-À-CELLES 52 0.60 1.810
2 2011 FACILITY PONT-À-CELLES 52 0.07 0.177
3 2011 FACILITY CHARLEROI 52 0.07 0.212
4 2011 FACILITY CHARLEROI 52 0.08 0.209
DL_Maximum Number_of_measur~ Measure_Minimum Measure_Mean Measure_Maximum
1 16.0 0 0 0 0
2 1.3 0 0 0 0
3 1.9 0 0 0 0
4 2.0 0 0 0 0
Note
DF <-
structure(list(Row.names = c("2011.FACILITY.PONT-À-CELLES", "2011.FACILITY.PONT-À-CELLES",
"2011.FACILITY.CHARLEROI", "2011.FACILITY.CHARLEROI"), `Number_of_analysis~` = c(52L,
52L, 52L, 52L), DL_Minimum = c(0.6, 0.07, 0.07, 0.08), DL_Mean = c(1.81,
0.177, 0.212, 0.209), DL_Maximum = c(16, 1.3, 1.9, 2), `Number_of_measur~` = c(0L,
0L, 0L, 0L), Measure_Minimum = c(0L, 0L, 0L, 0L), Measure_Mean = c(0L,
0L, 0L, 0L), Measure_Maximum = c(0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))

Related

A way to indicate all possible Likert response options for a particular column so that those not used have a 0 by them using pivot longer in R?

I have numerous likert-type questions in my data and am using pivot longer to get percentages of how often each option is used. For some questions, however, certain options are never indicated by a respondent (e.g., they never answered with a 1). However, I would still like to see each possible response for each item with a 0/0% if it wasn't used. For instance, let's say I have a data frame d1.
d1(names)
"Course" "likert_1" "likert_2" "likert_3" "likert_4"
d1_long <- d1 %>%
pivot_longer(-Course, names_to = "items", values_to = "val") %>%
group_by(items) %>%
group_by(items, Course) %>%
mutate(N= sum (is.na(val) == F),
val= as.character(val)) %>%
group_by(val, .add = TRUE) %>%
summarise(n = n(),
percent = round((n/N), digits = 2)) %>%
distinct()
head(d1_long)
# A tibble: 6 × 5
# Groups: items, Course, val [6]
items Course val n percent
<chr> <chr> <chr> <int> <dbl>
1 likert_1 A765 2 2 0.04
2 likert_1 A765 3 1 0.02
3 likert_1 A765 4 50 0.88
4 likert_1 B768 1 2 0.04
5 likert_1 B768 3 24 0.48
6 likert_1 B768 4 26 0.52
So, we can see that response option 1 wasn't used in course "A765", and option 2 wasn't used in course B768. What I am hoping to see is something like this:
head(d1_long)
# A tibble: 6 × 5
# Groups: items, Course, val [6]
items Course val n percent
<chr> <chr> <chr> <int> <dbl>
1 likert_1 A765 1 0 0.00
2 likert_1 A765 2 2 0.04
3 likert_1 A765 3 1 0.02
4 likert_1 A765 4 50 0.88
4 likert_1 B768 1 2 0.04
5 likert_1 B768 2 0 0.00
6 likert_1 B768 3 24 0.48
Any help is greatly appreciated- thanks!
Edited:
dput(d1_long)
structure(list(items = c("likert_1", "likert_1", "likert_1",
"likert_1", "likert_1", "likert_1"), Course = c("A765", "A765",
"A765", "B768", "B768", "B768"), val = c(2L, 3L, 4L, 1L, 3L,
4L), n = c(2L, 1L, 50L, 2L, 24L, 26L), percent = c(0.04, 0.02,
0.88, 0.04, 0.48, 0.52)), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L), groups = structure(list(
items = c("likert_1", "likert_1", "likert_1", "likert_1",
"likert_1", "likert_1"), Course = c("A765", "A765", "A765",
"B768", "B768", "B768"), val = c(2L, 3L, 4L, 1L, 3L, 4L),
.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), .drop = TRUE))
Edit 2: I should have noted -- not all items have the same response scheme. For instance, some are 1-5 others are 1-7. Thanks
Here is a way. Group by items and Course, then complete based on a vector of all possible responses. Columns n and percent are filled with zeros (the default is NA).
suppressPackageStartupMessages(library(tidyverse))
all_possible_resp <- 1:4
d1_long %>%
ungroup() %>%
group_by(items, Course) %>%
complete(val = all_possible_resp,
fill = list(n = 0, percent = 0)) %>%
ungroup()
#> # A tibble: 8 × 5
#> items Course val n percent
#> <chr> <chr> <int> <int> <dbl>
#> 1 likert_1 A765 1 0 0
#> 2 likert_1 A765 2 2 0.04
#> 3 likert_1 A765 3 1 0.02
#> 4 likert_1 A765 4 50 0.88
#> 5 likert_1 B768 1 2 0.04
#> 6 likert_1 B768 2 0 0
#> 7 likert_1 B768 3 24 0.48
#> 8 likert_1 B768 4 26 0.52
Created on 2022-06-22 by the reprex package (v2.0.1)

Combine list elements into a dataframe r

I currently have a list with columns as individual elements.
I would like to combine list elements with the same column names (i.e. bind rows) and merge across the different columns (i.e. bind columns) into a single data frame. I'm having difficulty finding examples of how to do this.
l = list(est = c(0, 0.062220390087795, 1.1020213968139, 0.0359939361491544
), se = c(0.0737200634874046, 0.237735179934829, 0.18105632705918,
0.111359438298789), rf = structure(c(NA, NA, NA, 4L), levels = c("Never\nsmoker",
"Occasional\nsmoker", "Ex-regular\nsmoker", "Smoker"), class = "factor"),
n = c(187L, 18L, 32L, 82L), model = c("Crude", "Crude", "Crude",
"Crude"), est = c(0, 0.112335510453586, 0.867095253670329,
0.144963556944891), se = c(0.163523775933409, 0.237039485900481,
0.186247776987999, 0.119887623484768), rf = structure(c(NA,
NA, NA, 4L), levels = c("Never\nsmoker", "Occasional\nsmoker",
"Ex-regular\nsmoker", "Smoker"), class = "factor"), n = c(187L,
18L, 32L, 82L), model = c("Model 1", "Model 1", "Model 1",
"Model 1"), est = c(0, 0.107097305324242, 0.8278765140371,
0.0958220447859447), se = c(0.164787596943329, 0.237347836229364,
0.187201880036661, 0.120882616647714), rf = structure(c(NA,
NA, NA, 4L), levels = c("Never\nsmoker", "Occasional\nsmoker",
"Ex-regular\nsmoker", "Smoker"), class = "factor"), n = c(187L,
18L, 32L, 82L), model = c("Model 2", "Model 2", "Model 2",
"Model 2"))
I would like the data to have the following format:
data.frame(
est = c(),
se = c(),
rf = c(),
model = c()
)
Any help would be appreciated. Thank you!
In this solution, first the elements of l are grouped by name and then are combined using c. Finally, the resulting list is converted to a dataframe using map_dfc.
library(dplyr)
library(purrr)
cols <- c("est", "se", "rf", "model")
setNames(cols,cols) |>
map(~l[names(l) == .x]) |>
map_dfc(~do.call(c, .x))
#> # A tibble: 12 × 4
#> est se rf model
#> <dbl> <dbl> <fct> <chr>
#> 1 0 0.0737 NA Crude
#> 2 0.0622 0.238 NA Crude
#> 3 1.10 0.181 NA Crude
#> 4 0.0360 0.111 Smoker Crude
#> 5 0 0.164 NA Model 1
#> 6 0.112 0.237 NA Model 1
#> 7 0.867 0.186 NA Model 1
#> 8 0.145 0.120 Smoker Model 1
#> 9 0 0.165 NA Model 2
#> 10 0.107 0.237 NA Model 2
#> 11 0.828 0.187 NA Model 2
#> 12 0.0958 0.121 Smoker Model 2
another option
library(purrr)
grp <- (seq(length(l)) - 1) %/% 5
l_split <- split(l, grp)
map_df(l_split, c)
#> # A tibble: 12 × 5
#> est se rf n model
#> <dbl> <dbl> <fct> <int> <chr>
#> 1 0 0.0737 <NA> 187 Crude
#> 2 0.0622 0.238 <NA> 18 Crude
#> 3 1.10 0.181 <NA> 32 Crude
#> 4 0.0360 0.111 Smoker 82 Crude
#> 5 0 0.164 <NA> 187 Model 1
#> 6 0.112 0.237 <NA> 18 Model 1
#> 7 0.867 0.186 <NA> 32 Model 1
#> 8 0.145 0.120 Smoker 82 Model 1
#> 9 0 0.165 <NA> 187 Model 2
#> 10 0.107 0.237 <NA> 18 Model 2
#> 11 0.828 0.187 <NA> 32 Model 2
#> 12 0.0958 0.121 Smoker 82 Model 2

How can I make a moving sum from a cell in R?

I have a dataframe looking like this:
date
P
>60?
03-31-2020
6.8
0
03-30-2020
5.0
0
03-29-2020
0.0
0
03-28-2020
0.0
0
03-27-2020
2.0
0
03-26-2020
0.0
0
03-25-2020
71.0
1
03-24-2020
2.0
0
03-23-2020
0.0
0
03-22-2020
23.8
0
03-21-2020
0.0
0
03-20-2020
23.8
0
Code to reproduce the dataframe:
df1 <- data.frame(date = c("03-31-2020", "03-30-2020", "03-29-2020", "03-28-2020", "03-27-2020", "03-26-2020",
"03-25-2020", "03-24-2020", "03-23-2020", "03-22-2020", "03-21-2020", "03-20-2020"),
P = c(6.8, 5.0, 0.0, 0.0, 2.0, 0.0, 71.0, 2.0, 0.0, 23.8, 0.0, 23.8),
Sup60 = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0))
I want to sum the P values N days befores the P > 60.
For example, the first barrier (number bigger than 60) is the P = 71 on the day 25-03-2020, from that i want to sum the 5 P values before that day, like:
2.0 + 0.0 + 23.8 + 0.0 + 23.8 = 49,6
It is a kind of moving sum because the concept is similar to a moving average.
Instead of the average of the last 5 values, for example, I want the sum of the last 5 values from a value greater than 60.
How can I do this?
Hi firstly we can solve how to calculate a running sum then we do an if_else on this column, as a general rule you always split complex problems into minor solvable problems
library(tidyverse)
df_example <- tibble::tribble(
~date, ~P, ~`>60?`,
"03-31-2020", 6.8, 0L,
"03-30-2020", 5, 0L,
"03-29-2020", 0, 0L,
"03-28-2020", 0, 0L,
"03-27-2020", 2, 0L,
"03-26-2020", 0, 0L,
"03-25-2020", 71, 1L,
"03-24-2020", 2, 0L,
"03-23-2020", 0, 0L,
"03-22-2020", 23.8, 0L,
"03-21-2020", 0, 0L,
"03-20-2020", 23.8, 0L
)
# lets start by doing a simple running sum
jjj <- df_example |>
arrange(date)
jjj |>
mutate(running_sum = slider::slide_dbl(.x = P,.f = ~ sum(.x),.before = 5,.after = -1)) |>
mutate(chosen_sum = if_else(P > 60,running_sum,NA_real_))
#> # A tibble: 12 x 5
#> date P `>60?` running_sum chosen_sum
#> <chr> <dbl> <int> <dbl> <dbl>
#> 1 03-20-2020 23.8 0 0 NA
#> 2 03-21-2020 0 0 23.8 NA
#> 3 03-22-2020 23.8 0 23.8 NA
#> 4 03-23-2020 0 0 47.6 NA
#> 5 03-24-2020 2 0 47.6 NA
#> 6 03-25-2020 71 1 49.6 49.6
#> 7 03-26-2020 0 0 96.8 NA
#> 8 03-27-2020 2 0 96.8 NA
#> 9 03-28-2020 0 0 75 NA
#> 10 03-29-2020 0 0 75 NA
#> 11 03-30-2020 5 0 73 NA
#> 12 03-31-2020 6.8 0 7 NA
Created on 2021-10-20 by the reprex package (v2.0.1)

R Megre Data Frame Column and Recode

I have 2 R data frames that looks like this:
DATA FRAME 1:
identifier
ef_posterior
position_no
classification
11111
0.260
1
yes
11111
0.0822
2
yes
11111
0.00797
3
yes
11111
0.04
4
no
11111
0.245
5
yes
11111
0.432
6
yes
11112
0.342
1
maybe
11112
0.453
2
yes
11112
0.0032
3
yes
11112
0.241
5
no
11112
0.0422
6
yes
11112
0.311
4
no
DATAFRAME 2:
study_identifier
%LVEF
11111
62
11112
76
I want to merge and rearrange these two data frames into something like this:
Study_identifier and identifier are the same thing (just different column names). Additionally, I would like to recode the classification so that yes = 0, no = 1, maybe = 2
identifier
pos_1
pos_1_class
pos_2
pos_2_class
pos_3
pos_3_class
pos_4
pos_4_class
pos_5
pos_5_class
pos_6
pos_6_class
%LVEF
11111
0.260
0
0.0822
0
0.00797
0
0.04
1
0.245
0
0.432
0
62
11112
0.342
2
0.453
0
0.0032
0
0.311
1
0.241
1
0.0422
0
76
df1 %>% mutate(position_no = paste0("position_", position_no)) %>%
pivot_wider(id_cols = identifier, names_from = position_no, values_from = ef_posterior) %>%
left_join(df2 %>% mutate(study_identifier = as.numeric(as.character(study_identifier))), by = c("identifier" = "study_identifier"))
This is the code I have right now, but I can't figure out where to put in the code for the classification column
How would I go about doing this?
Any help would be very much appreciated!
You can recode quite easily with dplyr and case_when:
df1 %>% mutate(
classification =
case_when( classification == "yes" ~ 1,
classification == "no" ~ 0,
classification == "maybe" ~ 2)
)
I would solve it the following way:
library(tidyverse)
df1 <- data.frame(
stringsAsFactors = FALSE,
identifier = c(11111L,11111L,11111L,11111L,
11111L,11111L,11112L,11112L,11112L,11112L,11112L,
11112L),
ef_posterior = c(0.26,0.0822,0.00797,0.04,
0.245,0.432,0.342,0.453,0.0032,0.241,0.0422,0.311),
position_no = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 5L, 6L, 4L),
classification = c("yes","yes","yes","no",
"yes","yes","maybe","yes","yes","no","yes","no")
)
df2 <- data.frame(
check.names = FALSE,
study_identifier = c(11111L, 11112L),
`%LVEF` = c(62L, 76L)
)
df1 %>% mutate(
classification =
case_when( classification == "yes" ~ 1,
classification == "no" ~ 0,
classification == "maybe" ~ 2)
) %>%
pivot_wider(
id_cols = c(identifier), names_from = c(position_no), values_from = c(classification,ef_posterior)) %>%
left_join(df2, by = c("identifier" = "study_identifier"))
#> # A tibble: 2 x 14
#> identifier classification_1 classification_2 classification_3 classification_4
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 11111 1 1 1 0
#> 2 11112 2 1 1 0
#> # … with 9 more variables: classification_5 <dbl>, classification_6 <dbl>,
#> # ef_posterior_1 <dbl>, ef_posterior_2 <dbl>, ef_posterior_3 <dbl>,
#> # ef_posterior_4 <dbl>, ef_posterior_5 <dbl>, ef_posterior_6 <dbl>,
#> # `%LVEF` <int>
Created on 2021-04-12 by the reprex package (v0.3.0)

Normalize data by substracting first row to every values in multiple columns

I tried to solve my problem by applying several solutions proposed on this forum but I did not work.
Basically, I have a data frame:
Concentration Salinity Light.Dark Distance Velocity In_Center Freezing Cruising Bursting Clockwise CounterClockwise
<ord> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 V 0.5 Dark 0.0612 0.0826 0.0638 0.0207 0.0124 0.00511 -0.0866 -0.0439
2 L 0.5 Dark 0.0360 0.0282 -0.166 -0.00475 0.148 -0.0328 -0.0337 0.0615
3 M 0.5 Dark -0.144 -0.147 0.00761 0.0405 -0.191 -0.00586 0.0772 -0.0123
4 H 0.5 Dark 0.0464 0.0362 0.0949 -0.0565 0.0306 0.0335 0.0431 -0.00527
>
I want to normalize the columns from Distance to CounterClockwise by subtracting the first row to every rows.
I tried:
df_norm= df %>%
mutate_at(4:11, list(~ .- first(.)))
But it returned only 0:
Concentration Salinity Light.Dark Distance Velocity In_Center Freezing Cruising Bursting Clockwise CounterClockwise
<ord> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 V 0.5 Dark 0 0 0 0 0 0 0 0
2 L 0.5 Dark 0 0 0 0 0 0 0 0
3 M 0.5 Dark 0 0 0 0 0 0 0 0
4 H 0.5 Dark 0 0 0 0 0 0 0 0
I tried to convert the tibble to a dataframe using:
as.data.frame(df_norm)
But I got:
Concentration Salinity Light.Dark Distance Velocity In_Center Freezing Cruising Bursting Clockwise CounterClockwise
1 V 0.5 Dark 0 0 0 0 0 0 0 0
2 L 0.5 Dark 0 0 0 0 0 0 0 0
3 M 0.5 Dark 0 0 0 0 0 0 0 0
4 H 0.5 Dark 0 0 0 0 0 0 0 0
Here is a dput of my df:
structure(list(Concentration = structure(1:4, .Label = c("V",
"L", "M", "H"), class = c("ordered", "factor")), Salinity = structure(c(1L,
1L, 1L, 1L), .Label = c("0.5", "2", "6"), class = "factor"),
Light.Dark = structure(c(1L, 1L, 1L, 1L), .Label = c("Dark",
"ERROR", "Light"), class = "factor"), Distance = c(0.0611762417792624,
0.0359847599237893, -0.143596409795565, 0.0464354080925131
), Velocity = c(0.0825514600369596, 0.0282499048624341, -0.146998610001507,
0.0361972451021132), In_Center = c(0.06383139350079, -0.166302972291672,
0.00760502103176895, 0.0948665577591132), Freezing = c(0.0206958889309448,
-0.00474520212061713, 0.0405259034871347, -0.0564765902974621
), Cruising = c(0.0123684826368456, 0.148343102625951, -0.191335919657439,
0.0306243343946422), Bursting = c(0.00511229994076513, -0.0327935337663713,
-0.00586044139551122, 0.0335416752211175), Clockwise = c(-0.0865980448950217,
-0.0337007169788508, 0.077213035103443, 0.0430857267704295
), CounterClockwise = c(-0.0439324933628217, 0.0615054907079504,
-0.0123010981415901, -0.00527189920353861)), row.names = c(NA,
-4L), groups = structure(list(Concentration = structure(1:4, .Label = c("V",
"L", "M", "H"), class = c("ordered", "factor")), Salinity = structure(c(1L,
1L, 1L, 1L), .Label = c("0.5", "2", "6"), class = "factor"),
.rows = structure(list(1L, 2L, 3L, 4L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Any idea of what I am doing wrong?
Thank you very much for your help!
Based on the dput, it is a grouped dataset and have only one row per group
df %>%
summarise(n = n())
# A tibble: 4 x 3
# Groups: Concentration [4]
# Concentration Salinity n
# <ord> <fct> <int>
#1 V 0.5 1
#2 L 0.5 1
#3 M 0.5 1
#4 H 0.5 1
so, basically, it is subtracting the same value.
If we want to do this on the full dataset, ungroup and then apply the code
df %>%
ungroup %>%
mutate(across(4:11, ~ . - first(.)))
# // to get the difference on numeric columns
# mutate(across(where(is.numeric), ~ . - first(.)))
# A tibble: 4 x 11
# Concentration Salinity Light.Dark Distance Velocity In_Center Freezing Cruising Bursting Clockwise CounterClockwise
# <ord> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 V 0.5 Dark 0 0 0 0 0 0 0 0
#2 L 0.5 Dark -0.0252 -0.0543 -0.230 -0.0254 0.136 -0.0379 0.0529 0.105
#3 M 0.5 Dark -0.205 -0.230 -0.0562 0.0198 -0.204 -0.0110 0.164 0.0316
#4 H 0.5 Dark -0.0147 -0.0464 0.0310 -0.0772 0.0183 0.0284 0.130 0.0387

Resources