Split row elements by character and transform into a vector

Split row elements by character and transform into a vector - r

I have a dataframe looking like:
library(tidyverse)
df <- tibble::tribble(
~respondent, ~selection,
1, "Brain/Energy/Sleep",
2, "Energy/Mood/Sex",
3, "Detox/Sex/Stress"
)
I want to count the unique elements in each row, after splitting them at each '/', hence transforming the column selection as:
selection <- c("selection", "Brain", "Energy", "Sleep", "Energy", "Mood", "Sex", "Detox", "Sex", "Stress")
How to perform this using dplyr?

new_vec <-
df %>%
pull(selection) %>%
strsplit("/") %>%
unlist() %>%
c("selection", .)
new_vec
[1] "selection" "Brain" "Energy" "Sleep" "Energy" "Mood" "Sex"
[8] "Detox" "Sex" "Stress"

Here is another solution that eliminates duplicate elements in each row. I added a duplicate element in the df to exemplify this.
#Your data with a duplicate element
df <- tibble::tribble(
~respondent, ~selection,
1, "Brain/Energy/Sleep/Sleep",
2, "Energy/Mood/Sex/Energy",
3, "Detox/Sex/Stress/Detox"
)
#Number of columns expected after splitting each row on "/"
ncols_exp<-4
#Getting the distinct values per row (respondent)
df %>%
#Separate each entry in selection as multiple columns
separate(col = selection,
into = paste0("Var", 1:ncols_exp),
sep = "/") %>%
#Transform data into long format
pivot_longer(cols = starts_with("Var"),
names_to = "Var",
values_to = "Val") %>%
#Group by "respondent" (each row in the original df)
group_by(respondent) %>%
#Get unique elements of the Val column
distinct(Val) %>%
#Pull Val column
pull(Val) %>%
#Concatenate the unique values with "selection" as the first entry
c("selection", .)

Related

How to reorder a list of tidygraph objects based on a column in the list in R?

I have a list of tidygraph objects. I am trying to reorder the list elements based on a certain criteria. That is, each element of my list has a column called name. I am trying to group together the list elements that have identical name columns... but also I would like to group them in descending order of their count as well (i.e., the count of equal name columns in each list element). Hopefully my example will explain more clearly.
To begin, I create some data, turn them into tidygraph objects and put them together in a list:
library(tidygraph)
library(tidyr)
# create some node and edge data for the tbl_graph
nodes1 <- data.frame(
name = c("x4", NA, NA),
val = c(1, 5, 2)
)
nodes2 <- data.frame(
name = c("x4", "x2", NA, NA, "x1", NA, NA),
val = c(3, 2, 2, 1, 1, 2, 7)
)
nodes3 <- data.frame(
name = c("x1", "x2", NA),
val = c(7, 4, 2)
)
nodes4 <- nodes1
nodes5 <- nodes2
nodes6 <- nodes1
edges <- data.frame(from = c(1, 1), to = c(2, 3))
edges1 <- data.frame(
from = c(1, 2, 2, 1, 5, 5),
to = c(2, 3, 4, 5, 6, 7)
)
# create the tbl_graphs
tg_1 <- tbl_graph(nodes = nodes1, edges = edges)
tg_2 <- tbl_graph(nodes = nodes2, edges = edges1)
tg_3 <- tbl_graph(nodes = nodes3, edges = edges)
tg_4 <- tbl_graph(nodes = nodes4, edges = edges)
tg_5 <- tbl_graph(nodes = nodes5, edges = edges1)
tg_6 <- tbl_graph(nodes = nodes6, edges = edges)
# put into list
myList <- list(tg_1, tg_2, tg_3, tg_4, tg_5, tg_6)
So, we can see that there are 6 tidygraph objects in myList.
Examining each element we can see that 3 objects have identical name columns (i.e., x4,NA,NA).... 2 objects have identical name columns ("x4", "x2", NA, NA, "x1", NA, NA).. and 1 object remains(x1,x2,NA).
Using a little function to get the counts of equal name columns:
# get a count of identical list elements based on `name` col
counts <- lapply(myList, function(x) {
x %>%
pull(name) %>%
paste0(collapse = "")
}) %>%
unlist(use.names = F) %>%
as_tibble() %>%
group_by(value) %>%
mutate(val = n():1) %>%
slice(1) %>%
arrange(-val)
Just for clarity:
> counts
# A tibble: 3 × 2
# Groups: value [3]
value val
<chr> <int>
1 x4 NA NA 3
2 x4 x2 NA NA x1 NA NA 2
3 x1 x2 NA 1
I would like to rearrange the order of list elements in myList based on the val column in my counts object.
My desired output would look something like this (which I am just manually reordering):
myList <- list(tg_1, tg_4, tg_6, tg_2, tg_5, tg_3)
Is there a way to automate the reordering of my list based on the count of identical name columns?
UPDATE:
So my attempted solution is to do the following:
ind <- map(myList, function(x){
x %>%
pull(name) %>%
replace_na("..") %>%
paste0(collapse = "")
}) %>%
unlist(use.names = F) %>%
as_tibble() %>%
mutate(ids = 1:n()) %>%
group_by(value) %>%
mutate(val = n():1) %>%
arrange(value) %>%
pull(ids)
# return new list of trees
myListNew <- myList[ind]
The above code groups the list elements by the name column and returns an index called ind. I'm then indexing my original list by the ind index to rearrange my list.
However, I would still like to find a way to sort the new list based on the total amount of each identical name variable... I still haven't figured that out yet.

After hours of testing, I eventually have a working solution.
ind <- map(myList, function(x){
x %>%
pull(name) %>%
replace_na("..") %>%
paste0(collapse = "")
}) %>%
unlist(use.names = F) %>%
as_tibble() %>%
mutate(ids = 1:n()) %>%
group_by(value) %>%
mutate(val = n():1) %>%
arrange(value)
ind <- ind %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(val)) %>%
pull(ids)
# return new list of trees
myListNew <- myList[ind]
The above code arranges the list by name alphabetically. Then I group by the name and create another column that ranks the row. I can then rearrange the rows based on this variable. Finally I index by the result.

Dplyr pipe operation obtaining minimum

When trying to combine R pipe operations with obtaining a minimum, I came across this issue. I would expect this to work but it doesn't. Can anyone explain to me why this is the case, and how to fix it?
df <- data.frame(ID = c(1,2,3,4),
Name = c("Name1", "Name1", "Name2", "Name3"),
Value = c(10, 14, 13, 1))
df <- df %>%
filter(grepl("name1", Name, ignore.case = TRUE)) %>%
min(Value)
Error in function_list[[k]](value) : object 'Value' not found

We can pull the column 'Value' as a vector and get the min
library(dplyr)
df %>%
filter(grepl("name1", Name, ignore.case = TRUE)) %>%
pull(Value) %>%
min
Or use summarise
df %>%
filter(grepl("name1", Name, ignore.case = TRUE)) %>%
summarise(Value = min(Value))
The reason is that the output from the %>% is the full dataset, we need to extract the column with $ or [[
df %>%
filter(grepl("name1", Name, ignore.case = TRUE)) %>%
{min(.$Value)}
#[1] 10

Calculation on every pair from grouped data.frame

My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.
I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.
I can accomplish this for one pair of locations as such:
df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(1:30), times =3),
Var1 = sample(1:25, 90, replace =T),
Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(31:60), times =3),
Var1 = sample(1:100, 90, replace =T),
Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data
dfl <- df %>% gather(VAR, value, 3:4)
df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)
I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.
Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.

One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code
library(tidyverse)
df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data df1
map(~ full_join(df1 %>%
filter(Location == first(.x)),
df1 %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
top_n(-1, DIFF))
Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)
df %>%
distinct(Location) %>%
pull %>%
as.character %>%
combn(m = 2, simplify = FALSE) %>%
map(~ df1 %>%
# change here i.e. filter both the Locations
filter(Location %in% .x) %>%
# spread it to wide format
spread(Location, value, fill = 0) %>%
# create the DIFF column by taking the differene
mutate(DIFF = abs(!! rlang::sym(first(.x)) -
!! rlang::sym(last(.x)))) %>%
group_by(VAR, Sample) %>%
top_n(-1, DIFF))

removing groups with a certain NA number

Sorry to bother with a relatively simple question perhaps.
I have this type of dataframe:
A long list of names in the column "NAME" c(a, b, c, d, e ...) , two potential classes in the column "SURNAME" c(A, B) and a third column containing values.
I want to remove all NAMES for which at least in one of the SURNAME classes I have more than 2 "NA" in the VALUE column.
I wanted to post an example dataset but I am struggling to format it properly
I was trying to use
df <- df %>%
group_by(NAME) %>%
group_by(SURNAME) %>%
filter(!is.na(VALUE)) %>%
filter(length(VALUE)>=3)
it does not throw an error but I have the impression that something is wrong. Any suggestion? Many thanks

Let's create a dataset to work with:
set.seed(1234)
df <- data.frame(
name = sample(x=letters, size=1e3, replace=TRUE),
surname = sample(x=c("A", "B"), size=1e3, replace=TRUE),
value = sample(x=c(1:10*10,NA), size=1e3, replace=TRUE),
stringsAsFactors = FALSE
)
Here's how to do it with Base R:
# count NAs by name-surname combos (na.action arg is important!)
agg <- aggregate(value ~ name + surname, data=df, FUN=function(x) sum(is.na(x)), na.action=NULL)
# rename is count of NAs column
names(agg)[3] <- "number_of_na"
#add count of NAs back to original data
df <- merge(df, agg, by=c("name", "surname"))
# subset the original data
result <- df[df$number_of_na < 3, ]
Here's how to do it with data.table:
library(data.table)
dt <- as.data.table(df)
dt[ , number_of_na := sum(is.na(value)), by=.(name, surname)]
result <- dt[number_of_na < 3]
Here's how to do it with dplr/tidyverse:
library(dplyr) # or library(tidyverse)
result <- df %>%
group_by(name, surname) %>%
summarize(number_of_na = sum(is.na(value))) %>%
right_join(df, by=c("name", "surname")) %>%
filter(number_of_na < 3)

After grouping by 'NAME', 'SURNAME', create a column with the number of NA elements in that group and then filter out any 'NAME' that have an 'ind' greater than or equal to 3
df %>%
group_by(NAME, SURNAME) %>%
mutate(ind = sum(is.na(VALUE))) %>%
group_by(NAME) %>%
filter(!any(ind >=3)) %>%
select(-ind)
Or do an anti_join after doing the filtering by 'NAME', 'SURNAME' based on the condition
df %>%
group_by(NAME, SURNAME) %>%
filter(sum(is.na(VALUE))>=3) %>%
ungroup %>%
distinct(NAME) %>%
anti_join(df, .)
data
set.seed(24)
df <- data.frame(NAME = rep(letters[1:5], each = 20),
SURNAME = sample(LETTERS[1:4], 5 * 20, replace = TRUE),
VALUE = sample(c(NA, 1:3), 5 *20, replace = TRUE),
stringsAsFactors = FALSE)

Sample groups and preserve row order

I have a dataframe such as:
df <- data.frame(id = factor(c(12321,12321,12321,4445,4445,4445,4445,787,787,787)),
word = c("please", "stop", "that", "the", "fox", "jumps", "that", "please", "eat", "noodles"),
word_id = c(12,5,28,99,214,800,28,12,78,912))
And I am attempting to take a sample of the data frame while preserving the id group and the word and word_id order.
I tried newDF <- df %>% group_by(id) %>% sample_frac(0.33) but this takes a sample of each group.
I would like to result in a dataframe that takes a sample of all id groups in the original dataframe and preserves the order of the columns. So if I want to take a 33% sample of df I will end up with 33% of the id groups and the columns remain in order.
newDF <- data.frame(id = factor(c(12321,12321,12321,4445,4445,4445,4445)),
word = c("please", "stop", "that", "the", "fox", "jumps", "that"),
word_id = c(12,5,28,99,214,800,28))

Adding to alistaire's comment:
library(dplyr)
library(tidyr)
newDF1 <- df %>%
group_by(id) %>%
nest() %>%
sample_frac(1/3) %>%
unnest()
newDF2 <- anti_join(df, newDF1, by = "id")