"Result must have length..." error with 'complete.cases' - r

This is an extension of the question I asked here where I was looking for a way to automate my labeling of subjects into groups based on if their data matched my filter.
Prior to attempting to the automating labeling, this is what I had.
library(tidyverse)
df <- structure(list(Subj_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
Location = c(1, 2, 3, 1, 4, 2, 1, 2, 5)), class = "data.frame",
row.names = c(NA, -9L))
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
complete.df <- df2 %>% filter(complete.cases(.))
In my actual data, there are some rows that have NA's and I need to be able to filter for both complete and incomplete cases so I can review the sub-data sets separately if needed.
My new code looks like this which assigns a subject to a group based on if they have a location data point 4 or 5:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>% ##this chunk breaks filter(complete.cases(.))
group_by(Subj_ID) %>%
mutate(group2 = case_when(any(Location == 4) | any(Location == 5) ~ "YES", TRUE ~ "NO"))
complete.df <- df3 %>% filter(complete.cases(.))
Once I generate df3 by mutating df2, my filter(complete.cases(.)) subsequently fails.
Yet, if I were to generate df3 by manual recoding, it works! As so:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>%
mutate(group2=
if_else(Subj_ID ==2 |
Subj_ID ==3,
"TRUE", "FALSE"))
complete.df <- df3 %>% filter(complete.cases(.))
Thoughts?

It would be the group_by attribute which causes the issue and can be solved by ungrouping and then apply the filter. In the OP's last code block (manual coding), it is not creating a grouping attribute and thus it works
library(dplyr)
df3 %>%
ungroup %>%
filter(complete.cases(.))
Or instead of complete.cases in filter, we can use !is.na with filter_all without removing the grouping attribute
df3 %>%
filter_all(any_vars(!is.na(.)))
OP mentioned about the last code block is working, but it doesn't have any group attribute. If we create one, then it fails too
df3 %>%
group_by(group) %>%
filter(complete.cases(.))
Error: Result must have length 3, not 9

Related

Set last N values of dataframe to NA in R

I am trying to group my dataframe and set the last N values of a column in each group to NA. I can do it for N = 1 like so:
df %>% group_by(ID) %>% mutate(target = c(target[-n()], NA))
But am struggling to get it to work for any N
This is my current attempt:
df %>% group_by(ID) %>% mutate(target = c(target[1:(abs(n()-1))], NA))
But this seems to fail for groups of size 1
I also tried:
df %>% group_by(ID) %>% mutate(target = ifelse(n()==1, target, c(target[1:(abs(n()-1))], NA)))
But the else clause never takes effect.
Any advice would be appreciated, thanks.
We could use
library(dplyr)
df %>%
group_by(ID) %>%
mutate(target = replace(target, tail(row_number(), N), NA))
You can use case_when() for a vectorized solution in dplyr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(target = case_when(row_number() <= n() - N ~ target))
My thanks to #akrun for pointing out that case_when() defaults to an NA of the proper type, so case_when() automatically fills the last N with NAs.
Update
The solution by #akrun is more performant: when benchmarked at times = 50 repetitions apiece
library(microbenchmark)
big_df <- tibble(ID = rep(letters, 100000)) %>%
mutate(target = row_number()) %>%
group_by(ID)
microbenchmark(
times = 50,
Greg = {
big_df %>%
mutate(target = case_when(row_number() <= n() - N ~ target))
},
akrun = {
big_df %>%
mutate(target = replace(target, tail(row_number(), N), NA))
}
)
it is about 35% faster than mine at scale (2600000 rows):
Unit: milliseconds
expr min lq mean median uq max neval
Greg 82.6337 90.9669 128.93278 96.35760 213.3593 258.8570 50
akrun 52.4519 55.8314 63.40997 61.43945 64.1082 196.4069 50
Here is an alternative suggestion:
We could define the top N after using arrange in descending order (with -x), apply our ifelse statement and rearrange:
library(dplyr)
N = 2
df %>%
group_by(id) %>%
arrange(-x, .by_group = TRUE) %>%
mutate(x = ifelse(row_number()== 1:N, NA, x)) %>%
arrange(x, .by_group = TRUE)
df <- structure(list(x = c(2, 4, 6, 1, 2, 5, 6, 7, 3, 4, 5, 6), id = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -12L))

Multiple ratios by column-wise division with dplyr following grouping

I have a df that needs to be grouped by multiple columns to subsequently calculate ratios for subset of different columns and the row-wise means and standard deviations.
grouper1 grouper2 condition value
foo baz A 1
foo baz B 2
foo oof A 1
foo oof C 3
bar zab B 2
bar zab C 4
Based on this elegant answer I have managed to built a generic solution:
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
crossing(c("A"), c("B","C")) %>%
pmap(~ query %>%
group_by(grouper1, grouper2) %>%
summarise(!! str_c('ratio_', ..1, ..2) :=
value[condition == ..1]/value[condition == ..2])) %>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
ungroup() %>% mutate(mean=rowMeans(select(.,-(grouper1, grouper2)), SD=unlist(pmap(select(.,-(grouper1, grouper2)), ~sd(c(...)))))
This works well if all the values in condition column are found in all groups. If this is not the case, e.g. A is not present in the second grouping using grouper1 in the above example, I will receive the following error:
Error: Column ratio_AC must be length 1 (a summary value), not 0
I could obviously preselect the values for crossing but this would require a filter on the df and I will loose generality. I would thus like a solution that simply ignores the missing combinations and still calculates the metrics.
One possible solution would be pivot_wider but here I cannot implement a working solution for calculation the ratios.
We could reshape to wide format with pivot_wider and then use that dataset
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df1 <- df %>%
pivot_wider(names_from = condition, values_from = value)
crossing(v1 = c("A"), v2 = c("B","C")) %>%
pmap(~ df1 %>%
transmute(grouper1, grouper2,
!! str_c('ratio_', ..1, ..2) :=
.[[..1]]/.[[..2]]))%>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
mutate(mean = rowMeans(select(., -grouper1, -grouper2), na.rm = TRUE),
SD= pmap_dbl(select(., -grouper1, -grouper2),
~sd(c(...), na.rm = TRUE)))
data
df <- structure(list(grouper1 = c("foo", "foo", "foo", "foo", "bar",
"bar"), grouper2 = c("baz", "baz", "oof", "oof", "zab", "zab"
), condition = c("A", "B", "A", "C", "B", "C"), value = c(1L,
2L, 1L, 3L, 2L, 4L)), class = "data.frame", row.names = c(NA,
-6L))

insert column names in dplyr

Let's assume I have a data frame with lots of columns: var1, ..., var100, and also a matching named vectors of the same length.
I would like to create a function that if in the data frame there are NA's it would pick the data from the named vector. This is what I wrote so far:
data %>%
mutate(var1 = ifelse(is.na(var1), named_vec["var1"], var1),
var2 = ifelse(is.na(var2), named_vec["var2"], var2),
...)
It works, however if I have 100's variable it would be very impractical to write so many conditions. I then tried this:
data %>%
mutate_if(~ifelse(is.na(.x), named_vec[colnames(.x)], .x))
Error in selected[[i]] <- eval_tidy(.p(column, ...)) :
more elements supplied than there are to replace
However this does not work. Is there a way in dplyr to extract the column name do I can slice the named vector?
Here a small example of data to try
data <- data.frame(var1 = c(1, 1, NA, 1),
var2 = c(2, NA, NA, 2),
var3 = c(3, 3, 3, NA))
named_vec <- c("var1" = 1, "var2" = 2, "var3" = 3)
It may be easier to do this with coalesce
library(dplyr)
library(purrr)
library(stringr)
nm1 <- str_c('var', 1:3)
data[nm1] <- map_dfc(nm1, ~ coalesce(data[[.x]], named_vec[.x]))
data
# var1 var2 var3
#1 1 2 3
#2 1 2 3
#3 1 2 3
#4 1 2 3
Or if we replicate the 'named_vec',
data[] <- coalesce(as.matrix(data), named_vec[col(data)])
Another option is to convert to 'long' format, then do a left_join, coalesce the 'value' columns, and reshape back to 'wide' format
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
left_join(enframe(named_vec), by = 'name') %>%
transmute(rn, name, value = coalesce(value.x, value.y)) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-rn)

Combine rows with duplicate identifiers while adding additional columns

Here's a simple example of what I'm looking for:
Before:
data.frame(
Name = c("pusheen", "pusheen", "puppy"),
Species = c("feline", "feline", "doggie"),
Activity = c("snacking", "napping", "playing"),
Start = c(1, 2, 3),
End = c(11, 12, 13)
)
After:
data.frame(
Name = c("pusheen", "puppy"),
Species = c("feline", "doggie"),
Activity1 = c("snacking", "playing"),
Start1 = c(1, 3),
End1 = c(11, 13),
Activity2 = c("napping", NA),
Start2 = c(2, NA),
End2 = c(12, NA)
)
How do I do this in R or Excel? Thanks!
This can be done using pivot_wider from the tidyr package.
library(tidyr)
library(dplyr)
library(magrittr)
df <- df %>%
group_by(Name) %>%
mutate(num = row_number()) %>% # Create a counter by group
ungroup() %>%
pivot_wider(
id_cols = c("Name", "Species"),
names_from = num,
values_from = c("Activity", "Start", "End"),
names_sep = "")
If you want the result ordered as in your sample output, we can add an additional select statement. I used str_sub from the stringr package to pull out the last character from each column name, and then sorted the names from there. This method of ordering columns should generalise to any number of activities.
library(stringr)
df %>%
select(Name, Species, names(df)[order(str_sub(names(df), -1))])

Calculation on every pair from grouped data.frame

My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.
I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.
I can accomplish this for one pair of locations as such:
df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(1:30), times =3),
Var1 = sample(1:25, 90, replace =T),
Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(31:60), times =3),
Var1 = sample(1:100, 90, replace =T),
Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data
dfl <- df %>% gather(VAR, value, 3:4)
df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)
I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.
Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.
One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code
library(tidyverse)
df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data df1
map(~ full_join(df1 %>%
filter(Location == first(.x)),
df1 %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
top_n(-1, DIFF))
Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)
df %>%
distinct(Location) %>%
pull %>%
as.character %>%
combn(m = 2, simplify = FALSE) %>%
map(~ df1 %>%
# change here i.e. filter both the Locations
filter(Location %in% .x) %>%
# spread it to wide format
spread(Location, value, fill = 0) %>%
# create the DIFF column by taking the differene
mutate(DIFF = abs(!! rlang::sym(first(.x)) -
!! rlang::sym(last(.x)))) %>%
group_by(VAR, Sample) %>%
top_n(-1, DIFF))

Resources