I would like to conditionally mutate a new column representing the pmax() of columns ending with "_n" for a given row. I know I can do this by explicitly specifying the column names, but I would prefer to have this be the result of a call to ends_with() or similar.
I have tried mutate_at() and plain mutate(). My general thought is that I need to pass a vars(ends_with("_n")) to something, but I'm just missing that something.
Thanks in advance.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(vs, gear) %>%
summarize(mean = mean(disp),
sd = sd(disp),
n = n()) %>%
mutate_if(is.double, round, 1) %>%
mutate(mean_sd = paste0(mean, " (", sd, ")")) %>%
select(-mean, -sd) %>%
group_by(vs, gear) %>%
nest(n, mean_sd, .key = "summary") %>%
spread(key = vs, value = summary) %>%
unnest(`0`, `1`, .sep = "_")
gear `0_n` `0_mean_sd` `1_n` `1_mean_sd`
<dbl> <int> <chr> <int> <chr>
1 3 12 357.6 (71.8) 3 201 (72)
2 4 2 160 (0) 10 115.6 (38.5)
3 5 4 229.3 (113.9) 1 95.1 (NA)
edit: both answers are much appreciated. Cheers!
Here's one way using the unquote-splice operator. We can select columns that we want to compare and then splice them as vectors into pmax:
library(tidyverse)
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl %>%
mutate(pmax = pmax(!!!select(., ends_with("_n"))))
#> # A tibble: 3 x 6
#> gear `0_n` `0_mean_sd` `1_n` `1_mean_sd` pmax
#> <dbl> <int> <chr> <int> <chr> <int>
#> 1 3 12 357.6 (71.8) 3 201 (72) 12
#> 2 4 2 160 (0) 10 115.6 (38.5) 10
#> 3 5 4 229.3 (113.9) 1 95.1 (NA) 4
Created on 2019-04-23 by the reprex package (v0.2.1)
A base R version, just as an alternative:
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl$pmax <- do.call(pmax,as.list(dat[,grepl("_n$",names(dat))]))
Related
Given a table of counts specified in 'dat' I would like to create a dataframe with 3 columns (race, grp and outcome) and 206 rows. The variable outcome would be 1 if for ascertained, and 0 if 'missed'.
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))
1) For each row set race in the output to that race, grp in the output to that group and then generate the appropriate number of 1s and 0s for outcome. The result is 206 x 3.
library(dplyr)
dat %>%
rowwise %>%
summarize(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
2) In the example data there are no duplicate race/grp and if that is true in general then it can alternately be written as::
dat %>%
group_by(race, grp) %>%
summarize(outcome = rep(1:0, c(ascertained, missed)), .groups = "drop")
3) A base R solution would be the following. If each combination of race/grp occurs on only one row of the input then 1:nrow(dat) could optionally be replaced with dat[1:2].
do.call("rbind",
by(dat,
1:nrow(dat),
with,
data.frame(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
)
)
How about this:
library(tidyverse)
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))
dat2 <- dat %>% select(-total) %>%
pivot_longer(c(ascertained, missed), names_to = "var", values_to="vals") %>%
uncount(vals) %>%
mutate(outcome = case_when(var == "ascertained" ~ 1,
TRUE ~ 0)) %>%
select(-var)
head(dat2)
#> # A tibble: 6 × 3
#> race grp outcome
#> <fct> <fct> <dbl>
#> 1 black hbpm 1
#> 2 black hbpm 1
#> 3 black hbpm 1
#> 4 black hbpm 1
#> 5 black hbpm 1
#> 6 black hbpm 1
dat2 %>%
group_by(race, grp, outcome) %>%
tally()
#> # A tibble: 8 × 4
#> # Groups: race, grp [4]
#> race grp outcome n
#> <fct> <fct> <dbl> <int>
#> 1 black hbpm 0 5
#> 2 black hbpm 1 63
#> 3 black uc 0 49
#> 4 black uc 1 24
#> 5 nonblack hbpm 0 3
#> 6 nonblack hbpm 1 32
#> 7 nonblack uc 0 9
#> 8 nonblack uc 1 21
This is based partially on the linked question from Limey in the comments:
library(tidyverse)
bind_rows(
dat %>% uncount(ascertained) %>% mutate(outcome = 1) %>% select(-missed, -total),
dat %>% uncount(missed) %>% mutate(outcome = 0) %>% select(-ascertained, -total)
)
Here is a relatively simple answer that is based on, in part, the answer suggested in a comment, but adapted to work for your problem, since you need multiple "uncounts". This answer uses function from the packages tibble, dplyr, and tidyr. These are all in the tidyverse.
The exact method is to create two sub-lists, one listing out the "ascertained", and one listing out the "missed", formatting the ascertained column as you wanted, and then mashing these two together with a basic tibble::add_row.
The relevant code is:
library(tidyverse)
dat2 <- uncount(dat, ascertained, .remove = F) %>%
mutate(ascertained = 1) %>%
select(-missed)
dat3 <- uncount(dat, missed, .remove = T) %>%
mutate(ascertained = 0)
dat4 <- add_row(dat2, dat3) %>% select(-total) %>%
rename(outcome = ascertained)
dat4 should be the data as you asked for it. I would suggest also generating an id column to make things easier to work with, but obviously that is up to you.
I have an excel file that looks like this:
ID
strength_score_week_1
agility_score_week_1
strength_score_week_2
agility_score_week_2
1
3
6
4
6
2
5
6
6
6
3
8
8
9
8
4
6
7
6
4
I want to rearrange/ rewrite the data above into a data frame that arranges it to this format:
Week
training type
mean score
1
agility
1
strength
2
agility
2
strength
essentially what I want to do with the final table is - I want to group it by training type and plot 2 line graphs showing the mean score for agility and strength over a period of 40 weeks
any help would be very much appreciated!
df <- data.frame(
ID = c(1L, 2L, 3L, 4L),
strength_score_week_1 = c(3L, 5L, 8L, 6L),
agility_score_week_1 = c(6L, 6L, 8L, 7L),
strength_score_week_2 = c(4L, 6L, 9L, 6L),
agility_score_week_2 = c(6L, 6L, 8L, 4L)
)
df
#> ID strength_score_week_1 agility_score_week_1 strength_score_week_2
#> 1 1 3 6 4
#> 2 2 5 6 6
#> 3 3 8 8 9
#> 4 4 6 7 6
#> agility_score_week_2
#> 1 6
#> 2 6
#> 3 8
#> 4 4
library(tidyverse)
df %>%
pivot_longer(!ID, names_pattern = '([^_]*)_score_week_(.*)', names_to = c('training_type', 'week')) %>%
group_by(week, training_type) %>%
summarise(mean_score = mean(value), .groups = 'drop') %>%
mutate(week = as.numeric(week)) %>%
ggplot(aes(x = week, y = mean_score, color = training_type, group = training_type)) +
geom_line()
Created on 2021-07-22 by the reprex package (v2.0.0)
Try this
library(readxl) #library to import excel sheets
df <- t(read_excel('Book1.xlsx')[,-1]) #import data (remove id column)
df_mean <- rowMeans(df) #calculate mean score
#get auxiliar matrix with names of elements
aux <- matrix(unlist(strsplit(rownames(df), '_')), nrow = nrow(df), byrow = T)[,c(1,4)]
colnames(aux) <- c('feature', 'week')
#Join everything in a data frame
df <- as.data.frame(cbind(df_mean, aux))
#plot
library(ggplot2)
ggplot(df)+
geom_point(aes(x = week, y = df_mean, colour = factor(feature)))
library(dplyr)
library(tibble)
library(stringr)
dt <- as.data.frame(t(dt))[-1,]
dt %>%
rownames_to_column() %>%
rowwise() %>%
mutate(`training type` = str_split(rowname, "_")[[1]][1],
week = str_split(rowname, "_")[[1]][4]) %>%
ungroup() %>%
mutate(`mean score` = rowMeans(.[,2:5])) %>%
select(week, `training type`, `mean score`)
Which results to:
# A tibble: 4 x 3
week `training type` `mean score`
<chr> <chr> <dbl>
1 1 strength 5.5
2 1 agility 6.75
3 2 strength 6.25
4 2 agility 6
If you have training types that contain multiple words you would to use a different function instead of str_split. If that is the case I can re-write that part of the code
A base R option
do.call(
rbind,
apply(
aggregate(
cbind(strength, agility) ~ time,
reshape(
setNames(df, gsub("_score_", ".", names(df))),
direction = "long",
idvar = "ID",
varying = -1
), mean
), 1, function(x) cbind(week = x[[1]], rev(stack(x[-1])))
)
)
gives
week ind values
1 week_1 strength 5.50
2 week_1 agility 6.75
3 week_2 strength 6.25
4 week_2 agility 6.00
I would use a mix of pivot_longer, seperate and mutate in this fashion,
data %>%
pivot_longer(cols = -"ID", names_to = "training_type") %>%
mutate(training_type = str_remove(training_type, "_score")) %>%
group_by(training_type) %>%
summarise(mean_score = mean(value, na.rm = TRUE)) %>%
separate(
col = "training_type",
sep = "_week_",
into = c("training_type", "week")
) %>%
mutate(week = as.numeric(week))
Which gives you the following output,
# A tibble: 4 x 3
training_type week mean_score
<chr> <dbl> <dbl>
1 agility 1 6.75
2 agility 2 6
3 strength 1 5.5
4 strength 2 6.25
Which are ready to be plotted by,
data %>% ggplot(
mapping = aes(
x = week,
y = mean_score,
color = training_type
)
) + geom_line()
Given the following data
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
I'm trying to coalesce all columns sharing the same initial prefix name (i.e. coalesce alpha_1, alpha_2, alpha_3, alpha_4, and coalesce beta_1 beta_2, etc.), but from both the left and right sides. That is, I want to generate two new variables, say 'alpha_left' and 'alpha_right', whose columns would be, in this example, (2, 2, 3) and (3, 4, 2) respectively (first non-missing elements from the left and right side of the dataframe).
User #akrun offered a great solution for the coalescing part here, but I'm unsure how to create two new variables from both the left and right coalesces.
Here is an option in tidyverse
Reshape to 'long' format - pivot_longer
Grouped by 'ID'
Do the summarise across the columns 'alpha' till 'charlie'
Get the column name - cur_column()
Create a tibble with the first non-NA element from the left and the right
Change the column names by appending the 'nm1' as prefix
Finally, unnest the list columns created in summarise
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = contains("_"),
names_to = c( ".value", "grp"), names_sep = "_") %>%
group_by(ID) %>%
summarise(across(alpha:charlie, ~ {
nm1 <- cur_column()
tbl1 <- tibble(left= .[complete.cases(.)][1],
right = rev(.)[complete.cases(rev(.))][1]);
names(tbl1) <- str_c(nm1, "_", names(tbl1))
list(tbl1)})) %>%
unnest(c(alpha, beta, charlie))
-output
# A tibble: 3 x 7
ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
<int> <int> <int> <int> <int> <int> <int>
1 1 2 3 3 3 1 1
2 2 2 4 2 2 2 2
3 3 3 2 2 2 1 1
Or using base R
lst1 <- lapply(split.default(df1[-1], sub("_\\d+$", "", names(df1)[-1])),
function(x) {
x1 <- apply(x, 1, function(y) {
y1 <- na.omit(y)
if(length(y1) > 1 ) y1[c(1, length(y1))] else y1[1]
})
if(is.vector(x1)) as.data.frame(matrix(x1)) else as.data.frame(t(x1))
})
You could also do:
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~data.frame(right = coalesce(!!!.x),
left = coalesce(!!!rev(.x))) %>%
set_names(paste(.y, names(.), sep="_")))
alpha_right alpha_left beta_right beta_left charlie_right charlie_left
1 2 3 3 3 1 1
2 2 4 2 2 2 2
3 3 2 2 2 1 1
One more approach not as elegant as #Onyambu's
library(tidyverse)
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~ .x %>% rowwise() %>%
mutate(!!paste0(.y, '_left') := head(na.omit(c_across(everything())),1),
!!paste0(.y, '_right') := tail(na.omit(c_across(!last_col())),1),
.keep = 'none' )
)
#> # A tibble: 3 x 6
#> # Rowwise:
#> alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int>
#> 1 2 3 3 3 1 1
#> 2 2 4 2 2 2 2
#> 3 3 2 2 2 1 1
Created on 2021-06-19 by the reprex package (v2.0.0)
Another option
library(tidyverse)
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
df1 %>%
pivot_longer(cols = -ID, names_sep = "_", names_to = c(".value", "set")) %>%
group_by(ID) %>%
fill(alpha:charlie, .direction = "updown") %>%
filter(set %in% range(set)) %>%
mutate(set = c("left", "right")) %>%
pivot_wider(id_cols = ID, names_from = set, values_from = alpha:charlie)
#> # A tibble: 3 x 7
#> # Groups: ID [3]
#> ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 3 3 1 1
#> 2 2 2 4 2 2 2 2
#> 3 3 3 2 2 2 1 1
Created on 2021-06-20 by the reprex package (v2.0.0)
I have two dataframes in R and want to compare any entries of rows. I want two check if the value of the first entrie, second entrie etc. of first (any) row of the first dataframe is bigger as the entrie of the first entrie of the the first row of the second dataframe. Afterwards it should give me a TRUE if all entries are bigger and in the intervall (0,2). It looks like this.
Dataframe 1
Letter 2011 2012 2013
A 2 3 5
B 6 6 6
C 5 4 8
Dataframe 2
Letter 2011 2012 2013
A 1 1 4
C 5 5 5
Result for example like this (comparing rows A and A and C and C)
Letter 2011 2012 2013
A 1 2 1 TRUE- all ok
C 0 -1 3 FALSE- second entrie smaller of the first table and third entrie much more
bigger of the first table.
One approach could be to convert data to long format, perform an inner_join subtract values, check if all the values are in range and get the data back in wide format.
library(dplyr)
library(tidyr)
df1 %>% pivot_longer(cols = -Letter) %>%
inner_join(df2 %>% pivot_longer(cols = -Letter), by = c("Letter", "name")) %>%
mutate(value = value.x - value.y) %>%
group_by(Letter) %>%
mutate(check = all(between(value, 0, 2))) %>%
select(-value.x, -value.y) %>%
pivot_wider()
# Letter check `2011` `2012` `2013`
# <chr> <lgl> <int> <int> <int>
#1 A TRUE 1 2 1
#2 C FALSE 0 -1 3
data
df1 <- structure(list(Letter = c("A", "B", "C"), `2011` = c(2L, 6L,5L),
`2012` = c(3L, 6L, 4L), `2013` = c(5L, 6L, 8L)), row.names = c(NA, -3L),
class = "data.frame")
df2 <- structure(list(Letter = c("A", "C"), `2011` = c(1L, 5L), `2012` = c(1L,
5L), `2013` = 4:5), row.names = c(NA, -2L), class = "data.frame")
This works as expected except for the fact that the column meaningful is also selected. I just want the following columns to be selected:
mean...summary, mean.conf.low...summary, mean.conf.high...summary
How can I do that?
library(tidyverse)
# dataframe
df <- structure(
list(
group = structure(1:2, .Label = c("a", "b"), class = "factor"),
meaningful = c(0.98, 1.39333333333333),
mean...summary = c(0.98,
1.39333333333333),
n...summary = c(3L, 3L),
mean.conf.low...summary = c(0.717103575690863,
0.921129311562406),
mean.conf.high...summary = c(1.24289642430914,
1.86553735510426)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -2L)
)
# changing few columns
df %>%
dplyr::mutate_at(
.tbl = .,
.vars = dplyr::vars(dplyr::matches("^mean...|^mean.conf")),
.funs = ~ format(round(x = ., digits = 3), nsmall = 3)
)
#> # A tibble: 2 x 6
#> group meaningful mean...summary n...summary mean.conf.low...~ mean.conf.high.~
#> <fct> <chr> <chr> <int> <chr> <chr>
#> 1 a 0.980 0.980 3 0.717 1.243
#> 2 b 1.393 1.393 3 0.921 1.866
Created on 2019-11-22 by the reprex package (v0.3.0)
. has a special meaning in regular expressions, so you could skip it and treated as a literal . by using \\, e.g.
matches("^mean\\.\\.\\.|^mean\\.conf") or matches("^mean\\.{3}|^mean\\.conf")