I am trying to group my dataframe and set the last N values of a column in each group to NA. I can do it for N = 1 like so:
df %>% group_by(ID) %>% mutate(target = c(target[-n()], NA))
But am struggling to get it to work for any N
This is my current attempt:
df %>% group_by(ID) %>% mutate(target = c(target[1:(abs(n()-1))], NA))
But this seems to fail for groups of size 1
I also tried:
df %>% group_by(ID) %>% mutate(target = ifelse(n()==1, target, c(target[1:(abs(n()-1))], NA)))
But the else clause never takes effect.
Any advice would be appreciated, thanks.
We could use
library(dplyr)
df %>%
group_by(ID) %>%
mutate(target = replace(target, tail(row_number(), N), NA))
You can use case_when() for a vectorized solution in dplyr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(target = case_when(row_number() <= n() - N ~ target))
My thanks to #akrun for pointing out that case_when() defaults to an NA of the proper type, so case_when() automatically fills the last N with NAs.
Update
The solution by #akrun is more performant: when benchmarked at times = 50 repetitions apiece
library(microbenchmark)
big_df <- tibble(ID = rep(letters, 100000)) %>%
mutate(target = row_number()) %>%
group_by(ID)
microbenchmark(
times = 50,
Greg = {
big_df %>%
mutate(target = case_when(row_number() <= n() - N ~ target))
},
akrun = {
big_df %>%
mutate(target = replace(target, tail(row_number(), N), NA))
}
)
it is about 35% faster than mine at scale (2600000 rows):
Unit: milliseconds
expr min lq mean median uq max neval
Greg 82.6337 90.9669 128.93278 96.35760 213.3593 258.8570 50
akrun 52.4519 55.8314 63.40997 61.43945 64.1082 196.4069 50
Here is an alternative suggestion:
We could define the top N after using arrange in descending order (with -x), apply our ifelse statement and rearrange:
library(dplyr)
N = 2
df %>%
group_by(id) %>%
arrange(-x, .by_group = TRUE) %>%
mutate(x = ifelse(row_number()== 1:N, NA, x)) %>%
arrange(x, .by_group = TRUE)
df <- structure(list(x = c(2, 4, 6, 1, 2, 5, 6, 7, 3, 4, 5, 6), id = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -12L))
Related
I have two dataframes that I want to turn into a single table using gt
library(dplyr)
library(gt)
a <- rnorm(21, mean = 112, sd =12)
colour <- rep(c("Blue", "Red", "Green"), 7)
data <- data.frame(colour, a)
data <- data %>%
group_by(colour) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
ungroup() %>%
gt()
a <- rnorm(21, mean = 60, sd =12)
day <- rep(c("2", "4", "6"), 7)
data2 <- data.frame(day, a)
data2 <- data2 %>%
group_by(day) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
ungroup() %>%
gt()
How do I stack the two dataframes on top of each other, and apply two sideways spanner labels of colour and day. Something similar to below where 2014, 2015 are my mean and sd columns, and China is colour, with blue, red green underneath, and India is day with the days stacked underneath.
OR (for curiosity/ideally).
Not have colour and day where China and India are, but instead have a sideways spanner. (i.e. vertical instead of horizontal). Horizontal isn't good for my real data as there is too many categories and would make it a really wide table.
You can give each a variable to identify what the group will be ("grp") and rename the color/day variable ("cat") and use bind_rows to combine them before using gt:
a <- rnorm(21, mean = 112, sd =12)
colour <- rep(c("Blue", "Red", "Green"), 7)
data <- data.frame(colour, a) %>%
group_by(colour) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
mutate(grp = "colour") %>%
rename(cat = colour)
b <- rnorm(21, mean = 60, sd =12)
day <- rep(c("2", "4", "6"), 7)
data2 <- data.frame(day, b) %>%
group_by(day) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
mutate(grp = "day") %>%
rename(cat = day)
bind_rows(data, data2) %>%
group_by(grp) %>%
gt(rowname_col = "cat")
Also I don't think this is exactly what you are asking for on the second option, but there is a row_group.as_column option for tab_options:
bind_rows(data, data2) %>%
group_by(grp) %>%
gt() %>%
tab_options(row_group.as_column = TRUE)
I have the following data and table:
library(gt)
library(dplyr)
a <- rnorm(21, mean = 112, sd =12)
colour <- rep(c("Blue", "Red", "Green"), 7)
data <- data.frame(colour, a) %>%
group_by(colour) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
mutate(grp = html("[H<sub>2</sub>O]")) %>%
rename(cat = colour)
b <- rnorm(21, mean = 60, sd =12)
day <- rep(c("2", "4", "6"), 7)
data2 <- data.frame(day, b) %>%
group_by(day) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
mutate(grp = html("[H<sub>2</sub>O] Additition <br> (Days)")) %>%
rename(cat = day)
bind_rows(data, data2) %>%
group_by(grp) %>%
gt(rowname_col = "cat")
bind_rows(data, data2) %>%
group_by(grp) %>%
gt() %>%
tab_options(row_group.as_column = TRUE)
The row group labels appear literally as '[H<sub>2<\sub>O]', rather than [H2O] etc. It is likely that I am using HTML wrong and it needs to be used with another package/function. I have also tried using cols_label but doesn't recognise these as columns in the dataframe.
Is there also a way to have the row groups column vertically centered, rather than at the top where is currently is? How do you bold these row groups?
The html function won't work outside of a gt table, so you'll have to create the row groups using tab_row_group and add the html labels there.
data <- data.frame(colour, a) %>%
group_by(colour) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
mutate(grp = "color") %>%
rename(cat = colour)
data2 <- data.frame(day, b) %>%
group_by(day) %>%
summarise(mean = mean(a), sd = sd(a), n = n()) %>%
mutate(grp = "day") %>%
rename(cat = day)
bind_rows(data, data2) %>%
gt() %>%
tab_row_group(
label = html("[H<sub>2</sub>O]"),
rows = grp == "color"
) %>%
tab_row_group(
label = html("[H<sub>2</sub>O] Additition <br> (Days)"),
rows = grp == "day"
) %>%
cols_hide(grp)
I have a df that needs to be grouped by multiple columns to subsequently calculate ratios for subset of different columns and the row-wise means and standard deviations.
grouper1 grouper2 condition value
foo baz A 1
foo baz B 2
foo oof A 1
foo oof C 3
bar zab B 2
bar zab C 4
Based on this elegant answer I have managed to built a generic solution:
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
crossing(c("A"), c("B","C")) %>%
pmap(~ query %>%
group_by(grouper1, grouper2) %>%
summarise(!! str_c('ratio_', ..1, ..2) :=
value[condition == ..1]/value[condition == ..2])) %>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
ungroup() %>% mutate(mean=rowMeans(select(.,-(grouper1, grouper2)), SD=unlist(pmap(select(.,-(grouper1, grouper2)), ~sd(c(...)))))
This works well if all the values in condition column are found in all groups. If this is not the case, e.g. A is not present in the second grouping using grouper1 in the above example, I will receive the following error:
Error: Column ratio_AC must be length 1 (a summary value), not 0
I could obviously preselect the values for crossing but this would require a filter on the df and I will loose generality. I would thus like a solution that simply ignores the missing combinations and still calculates the metrics.
One possible solution would be pivot_wider but here I cannot implement a working solution for calculation the ratios.
We could reshape to wide format with pivot_wider and then use that dataset
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df1 <- df %>%
pivot_wider(names_from = condition, values_from = value)
crossing(v1 = c("A"), v2 = c("B","C")) %>%
pmap(~ df1 %>%
transmute(grouper1, grouper2,
!! str_c('ratio_', ..1, ..2) :=
.[[..1]]/.[[..2]]))%>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
mutate(mean = rowMeans(select(., -grouper1, -grouper2), na.rm = TRUE),
SD= pmap_dbl(select(., -grouper1, -grouper2),
~sd(c(...), na.rm = TRUE)))
data
df <- structure(list(grouper1 = c("foo", "foo", "foo", "foo", "bar",
"bar"), grouper2 = c("baz", "baz", "oof", "oof", "zab", "zab"
), condition = c("A", "B", "A", "C", "B", "C"), value = c(1L,
2L, 1L, 3L, 2L, 4L)), class = "data.frame", row.names = c(NA,
-6L))
This is an extension of the question I asked here where I was looking for a way to automate my labeling of subjects into groups based on if their data matched my filter.
Prior to attempting to the automating labeling, this is what I had.
library(tidyverse)
df <- structure(list(Subj_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
Location = c(1, 2, 3, 1, 4, 2, 1, 2, 5)), class = "data.frame",
row.names = c(NA, -9L))
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
complete.df <- df2 %>% filter(complete.cases(.))
In my actual data, there are some rows that have NA's and I need to be able to filter for both complete and incomplete cases so I can review the sub-data sets separately if needed.
My new code looks like this which assigns a subject to a group based on if they have a location data point 4 or 5:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>% ##this chunk breaks filter(complete.cases(.))
group_by(Subj_ID) %>%
mutate(group2 = case_when(any(Location == 4) | any(Location == 5) ~ "YES", TRUE ~ "NO"))
complete.df <- df3 %>% filter(complete.cases(.))
Once I generate df3 by mutating df2, my filter(complete.cases(.)) subsequently fails.
Yet, if I were to generate df3 by manual recoding, it works! As so:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>%
mutate(group2=
if_else(Subj_ID ==2 |
Subj_ID ==3,
"TRUE", "FALSE"))
complete.df <- df3 %>% filter(complete.cases(.))
Thoughts?
It would be the group_by attribute which causes the issue and can be solved by ungrouping and then apply the filter. In the OP's last code block (manual coding), it is not creating a grouping attribute and thus it works
library(dplyr)
df3 %>%
ungroup %>%
filter(complete.cases(.))
Or instead of complete.cases in filter, we can use !is.na with filter_all without removing the grouping attribute
df3 %>%
filter_all(any_vars(!is.na(.)))
OP mentioned about the last code block is working, but it doesn't have any group attribute. If we create one, then it fails too
df3 %>%
group_by(group) %>%
filter(complete.cases(.))
Error: Result must have length 3, not 9
Here's a simple example of what I'm looking for:
Before:
data.frame(
Name = c("pusheen", "pusheen", "puppy"),
Species = c("feline", "feline", "doggie"),
Activity = c("snacking", "napping", "playing"),
Start = c(1, 2, 3),
End = c(11, 12, 13)
)
After:
data.frame(
Name = c("pusheen", "puppy"),
Species = c("feline", "doggie"),
Activity1 = c("snacking", "playing"),
Start1 = c(1, 3),
End1 = c(11, 13),
Activity2 = c("napping", NA),
Start2 = c(2, NA),
End2 = c(12, NA)
)
How do I do this in R or Excel? Thanks!
This can be done using pivot_wider from the tidyr package.
library(tidyr)
library(dplyr)
library(magrittr)
df <- df %>%
group_by(Name) %>%
mutate(num = row_number()) %>% # Create a counter by group
ungroup() %>%
pivot_wider(
id_cols = c("Name", "Species"),
names_from = num,
values_from = c("Activity", "Start", "End"),
names_sep = "")
If you want the result ordered as in your sample output, we can add an additional select statement. I used str_sub from the stringr package to pull out the last character from each column name, and then sorted the names from there. This method of ordering columns should generalise to any number of activities.
library(stringr)
df %>%
select(Name, Species, names(df)[order(str_sub(names(df), -1))])