insert column names in dplyr - r

Let's assume I have a data frame with lots of columns: var1, ..., var100, and also a matching named vectors of the same length.
I would like to create a function that if in the data frame there are NA's it would pick the data from the named vector. This is what I wrote so far:
data %>%
mutate(var1 = ifelse(is.na(var1), named_vec["var1"], var1),
var2 = ifelse(is.na(var2), named_vec["var2"], var2),
...)
It works, however if I have 100's variable it would be very impractical to write so many conditions. I then tried this:
data %>%
mutate_if(~ifelse(is.na(.x), named_vec[colnames(.x)], .x))
Error in selected[[i]] <- eval_tidy(.p(column, ...)) :
more elements supplied than there are to replace
However this does not work. Is there a way in dplyr to extract the column name do I can slice the named vector?
Here a small example of data to try
data <- data.frame(var1 = c(1, 1, NA, 1),
var2 = c(2, NA, NA, 2),
var3 = c(3, 3, 3, NA))
named_vec <- c("var1" = 1, "var2" = 2, "var3" = 3)

It may be easier to do this with coalesce
library(dplyr)
library(purrr)
library(stringr)
nm1 <- str_c('var', 1:3)
data[nm1] <- map_dfc(nm1, ~ coalesce(data[[.x]], named_vec[.x]))
data
# var1 var2 var3
#1 1 2 3
#2 1 2 3
#3 1 2 3
#4 1 2 3
Or if we replicate the 'named_vec',
data[] <- coalesce(as.matrix(data), named_vec[col(data)])
Another option is to convert to 'long' format, then do a left_join, coalesce the 'value' columns, and reshape back to 'wide' format
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
left_join(enframe(named_vec), by = 'name') %>%
transmute(rn, name, value = coalesce(value.x, value.y)) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-rn)

Related

How to reorder a list of tidygraph objects based on a column in the list in R?

I have a list of tidygraph objects. I am trying to reorder the list elements based on a certain criteria. That is, each element of my list has a column called name. I am trying to group together the list elements that have identical name columns... but also I would like to group them in descending order of their count as well (i.e., the count of equal name columns in each list element). Hopefully my example will explain more clearly.
To begin, I create some data, turn them into tidygraph objects and put them together in a list:
library(tidygraph)
library(tidyr)
# create some node and edge data for the tbl_graph
nodes1 <- data.frame(
name = c("x4", NA, NA),
val = c(1, 5, 2)
)
nodes2 <- data.frame(
name = c("x4", "x2", NA, NA, "x1", NA, NA),
val = c(3, 2, 2, 1, 1, 2, 7)
)
nodes3 <- data.frame(
name = c("x1", "x2", NA),
val = c(7, 4, 2)
)
nodes4 <- nodes1
nodes5 <- nodes2
nodes6 <- nodes1
edges <- data.frame(from = c(1, 1), to = c(2, 3))
edges1 <- data.frame(
from = c(1, 2, 2, 1, 5, 5),
to = c(2, 3, 4, 5, 6, 7)
)
# create the tbl_graphs
tg_1 <- tbl_graph(nodes = nodes1, edges = edges)
tg_2 <- tbl_graph(nodes = nodes2, edges = edges1)
tg_3 <- tbl_graph(nodes = nodes3, edges = edges)
tg_4 <- tbl_graph(nodes = nodes4, edges = edges)
tg_5 <- tbl_graph(nodes = nodes5, edges = edges1)
tg_6 <- tbl_graph(nodes = nodes6, edges = edges)
# put into list
myList <- list(tg_1, tg_2, tg_3, tg_4, tg_5, tg_6)
So, we can see that there are 6 tidygraph objects in myList.
Examining each element we can see that 3 objects have identical name columns (i.e., x4,NA,NA).... 2 objects have identical name columns ("x4", "x2", NA, NA, "x1", NA, NA).. and 1 object remains(x1,x2,NA).
Using a little function to get the counts of equal name columns:
# get a count of identical list elements based on `name` col
counts <- lapply(myList, function(x) {
x %>%
pull(name) %>%
paste0(collapse = "")
}) %>%
unlist(use.names = F) %>%
as_tibble() %>%
group_by(value) %>%
mutate(val = n():1) %>%
slice(1) %>%
arrange(-val)
Just for clarity:
> counts
# A tibble: 3 × 2
# Groups: value [3]
value val
<chr> <int>
1 x4 NA NA 3
2 x4 x2 NA NA x1 NA NA 2
3 x1 x2 NA 1
I would like to rearrange the order of list elements in myList based on the val column in my counts object.
My desired output would look something like this (which I am just manually reordering):
myList <- list(tg_1, tg_4, tg_6, tg_2, tg_5, tg_3)
Is there a way to automate the reordering of my list based on the count of identical name columns?
UPDATE:
So my attempted solution is to do the following:
ind <- map(myList, function(x){
x %>%
pull(name) %>%
replace_na("..") %>%
paste0(collapse = "")
}) %>%
unlist(use.names = F) %>%
as_tibble() %>%
mutate(ids = 1:n()) %>%
group_by(value) %>%
mutate(val = n():1) %>%
arrange(value) %>%
pull(ids)
# return new list of trees
myListNew <- myList[ind]
The above code groups the list elements by the name column and returns an index called ind. I'm then indexing my original list by the ind index to rearrange my list.
However, I would still like to find a way to sort the new list based on the total amount of each identical name variable... I still haven't figured that out yet.
After hours of testing, I eventually have a working solution.
ind <- map(myList, function(x){
x %>%
pull(name) %>%
replace_na("..") %>%
paste0(collapse = "")
}) %>%
unlist(use.names = F) %>%
as_tibble() %>%
mutate(ids = 1:n()) %>%
group_by(value) %>%
mutate(val = n():1) %>%
arrange(value)
ind <- ind %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(val)) %>%
pull(ids)
# return new list of trees
myListNew <- myList[ind]
The above code arranges the list by name alphabetically. Then I group by the name and create another column that ranks the row. I can then rearrange the rows based on this variable. Finally I index by the result.

Group dataframe row and column wise based on other dataframe?

I have a dataframe that I would like to group in both directions, first rowise and columnwise after. The first part worked well, but I am stuck with the second one. I would appreciate any help or advice for a solution that does both steps at the same time.
This is the dataframe:
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
This is the second dataframe, which holds the "recipe" for grouping:
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
Rowise grouping:
df1_grouped<-bind_cols(df1[1:2], map_df(df2, ~rowSums(df1[unique(.x)])))
Now i would like to apply the same grouping to the ID2 column and sum the values in the other columns. My idea was to mutate a another column (e.g. "group", which contains the name of the final group of ID2. After this i can use group_by() and summarise() to calculate the sum for each. However, I can't figure out an automated way to do it
bind_cols(df1_grouped,
#add group label
data.frame(
group = rep(c("Group_2","Group_1","Group_1","Group_2","Group_3"),2))) %>%
#remove temporary label column and make ID a character column
mutate(ID2=group,
ID=as.character(ID))%>%
select(-group) %>%
#summarise
group_by(ID,ID2)%>%
summarise_if(is.numeric, sum, na.rm = TRUE)
This is the final table I need, but I had to manually assign the groups, which is impossible for big datasets
I will offer such a solution
library(tidyverse)
set.seed(1)
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
df2 <- df2 %>% pivot_longer(everything())
df1 %>%
pivot_longer(-c(ID, ID2)) %>%
mutate(gr_r = df2$name[match(ID2, table = df2$value)],
gr_c = df2$name[match(name, table = df2$value)]) %>%
arrange(ID, gr_r, gr_c) %>%
pivot_wider(c(ID, gr_r), names_from = gr_c, values_from = value, values_fn = list(value = sum))

Writing for loop in r to combine columns that has matching names (with little variance)

I have a data frame where column names are duplicated once. Now I need to combine them to get a proper data set. I can use dplyr select command to extract matching columns and combine them later. However, I wish to achieve it using for loop.
#Example data frame
x <- c(1, NA, 3)
y <- c(1, NA, 4)
x.1 <- c(NA, 3, NA)
y.1 <- c(NA, 5, NA)
data <- data.frame(x, y, x1, y1)
##with `dplyr` I can do like
t1 <- data%>%select(contains("x"))%>%
mutate(x = rowSums(., na.rm = TRUE))%>%
select(x)
t2 <- data%>%select(contains("y"))%>%
mutate(y = rowSums(., na.rm = TRUE))%>%
select(y)
data <- cbind(t1,t2)
This is cumbersome as I have more than 25 similar columns
How to achieve the same result using for loop by matching columns names and perform rowSums. Or even simple approach using dplyr will also help.
We can use split.default to split based on the substring of the column names into a list and then apply the rowSums
library(dplyr)
library(stringr)
library(purrr)
data %>%
split.default(str_remove(names(.), "\\.\\d+")) %>%
map_dfr(rowSums, na.rm = TRUE)
# A tibble: 3 x 2
# x y
# <dbl> <dbl>
#1 1 1
#2 3 5
#3 3 4
If we want to use a for loop
un1 <- unique(sub("\\..*", "", names(data)))
out <- setNames(rep(list(NA), length(un1)), un1)
for(un in un1) {
out[[un]] <- rowSums(data[grep(un, names(data))], na.rm = TRUE)
}
as.data.frame(out)
data
data <- structure(list(x = c(1, NA, 3), y = c(1, NA, 4), x.1 = c(NA,
3, NA), y.1 = c(NA, 5, NA)), class = "data.frame", row.names = c(NA,
-3L))
Using purrr::map_dfc and transmute instead of mutate
library(dplyr)
purrr::map_dfc(c('x','y'), ~data %>% select(contains(.x)) %>%
transmute(!!.x := rowSums(., na.rm = TRUE)))
x y
1 1 1
2 3 5
3 3 4

Calculation on every pair from grouped data.frame

My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.
I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.
I can accomplish this for one pair of locations as such:
df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(1:30), times =3),
Var1 = sample(1:25, 90, replace =T),
Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(31:60), times =3),
Var1 = sample(1:100, 90, replace =T),
Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data
dfl <- df %>% gather(VAR, value, 3:4)
df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)
I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.
Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.
One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code
library(tidyverse)
df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data df1
map(~ full_join(df1 %>%
filter(Location == first(.x)),
df1 %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
top_n(-1, DIFF))
Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)
df %>%
distinct(Location) %>%
pull %>%
as.character %>%
combn(m = 2, simplify = FALSE) %>%
map(~ df1 %>%
# change here i.e. filter both the Locations
filter(Location %in% .x) %>%
# spread it to wide format
spread(Location, value, fill = 0) %>%
# create the DIFF column by taking the differene
mutate(DIFF = abs(!! rlang::sym(first(.x)) -
!! rlang::sym(last(.x)))) %>%
group_by(VAR, Sample) %>%
top_n(-1, DIFF))

removing groups with a certain NA number

Sorry to bother with a relatively simple question perhaps.
I have this type of dataframe:
A long list of names in the column "NAME" c(a, b, c, d, e ...) , two potential classes in the column "SURNAME" c(A, B) and a third column containing values.
I want to remove all NAMES for which at least in one of the SURNAME classes I have more than 2 "NA" in the VALUE column.
I wanted to post an example dataset but I am struggling to format it properly
I was trying to use
df <- df %>%
group_by(NAME) %>%
group_by(SURNAME) %>%
filter(!is.na(VALUE)) %>%
filter(length(VALUE)>=3)
it does not throw an error but I have the impression that something is wrong. Any suggestion? Many thanks
Let's create a dataset to work with:
set.seed(1234)
df <- data.frame(
name = sample(x=letters, size=1e3, replace=TRUE),
surname = sample(x=c("A", "B"), size=1e3, replace=TRUE),
value = sample(x=c(1:10*10,NA), size=1e3, replace=TRUE),
stringsAsFactors = FALSE
)
Here's how to do it with Base R:
# count NAs by name-surname combos (na.action arg is important!)
agg <- aggregate(value ~ name + surname, data=df, FUN=function(x) sum(is.na(x)), na.action=NULL)
# rename is count of NAs column
names(agg)[3] <- "number_of_na"
#add count of NAs back to original data
df <- merge(df, agg, by=c("name", "surname"))
# subset the original data
result <- df[df$number_of_na < 3, ]
Here's how to do it with data.table:
library(data.table)
dt <- as.data.table(df)
dt[ , number_of_na := sum(is.na(value)), by=.(name, surname)]
result <- dt[number_of_na < 3]
Here's how to do it with dplr/tidyverse:
library(dplyr) # or library(tidyverse)
result <- df %>%
group_by(name, surname) %>%
summarize(number_of_na = sum(is.na(value))) %>%
right_join(df, by=c("name", "surname")) %>%
filter(number_of_na < 3)
After grouping by 'NAME', 'SURNAME', create a column with the number of NA elements in that group and then filter out any 'NAME' that have an 'ind' greater than or equal to 3
df %>%
group_by(NAME, SURNAME) %>%
mutate(ind = sum(is.na(VALUE))) %>%
group_by(NAME) %>%
filter(!any(ind >=3)) %>%
select(-ind)
Or do an anti_join after doing the filtering by 'NAME', 'SURNAME' based on the condition
df %>%
group_by(NAME, SURNAME) %>%
filter(sum(is.na(VALUE))>=3) %>%
ungroup %>%
distinct(NAME) %>%
anti_join(df, .)
data
set.seed(24)
df <- data.frame(NAME = rep(letters[1:5], each = 20),
SURNAME = sample(LETTERS[1:4], 5 * 20, replace = TRUE),
VALUE = sample(c(NA, 1:3), 5 *20, replace = TRUE),
stringsAsFactors = FALSE)

Resources