How to group comma-separated variables in the same column?

How to group comma-separated variables in the same column? - r

Here is my false data:
#> id column
#> 1 blue, red, dog, cat
#> 2 red, blue, dog
#> 3 blue
#> 4 red
#> 5 dog, cat
#> 6 cat
#> 7 red, cat
#> 8 dog
#> 9 cat, red
#> 10 blue, cat
I want to tell R for example that dog and cat = animal and red and blue = colour. I want to basically count the number (and eventually percentage) of animals, colours and both.
#> id column newcolumn
#> 1 blue, red, dog, cat both
#> 2 red, blue, dog both
#> 3 blue colour
#> 4 red colour
#> 5 dog, cat animal
#> 6 cat animal
#> 7 red, cat both
#> 8 dog animal
#> 9 cat, red both
#> 10 blue, cat both
So far I have only been able to total up the number of red, blue, dog and cat by doing the following:
column.string<-paste(df$column, collapse=",")
column.vector<-strsplit(column.string, ",")[[1]]
column.vector.clean<-gsub(" ", "", column.vector)
table(column.vector.clean)
Would be very grateful for help, here is my sample false data:
df <- data.frame(id = c(1:10),
column = c("blue, red, dog, cat", "red, blue, dog", "blue", "red", "dog, cat", "cat", "red, cat", "dog", "cat, red", "blue, cat"))

You can define all possible animals and colours in a vector. Split column on comma and test :
animal <- c('dog', 'cat')
colour <- c('red', 'blue')
df$newcolumn <- sapply(strsplit(df$column, ',\\s*'), function(x) {
x <- x[x != "NA"]
if(!length(x)) return(NA)
if(all(x %in% animal)) 'animal'
else if(all(x %in% colour)) 'colour'
else 'both'
})
df
# id column newcolumn
#1 1 blue, red, dog, cat both
#2 2 red, blue, dog both
#3 3 blue colour
#4 4 red colour
#5 5 dog, cat animal
#6 6 cat animal
#7 7 red, cat both
#8 8 dog animal
#9 9 cat, red both
#10 10 blue, cat both
To calculate the proportion, you can then use prop.table with table :
prop.table(table(df$newcolumn, useNA = "ifany"))
#animal both colour
# 0.3 0.5 0.2
Using dplyr, we can separate the rows on comma, for each id create a newcolumn based on conditions and calculate the proportions.
library(dplyr)
df %>%
tidyr::separate_rows(column, sep = ',\\s*') %>%
group_by(id) %>%
summarise(newcolumn = case_when(all(column %in% animal) ~ 'animal',
all(column %in% colour) ~ 'colour',
TRUE ~ 'both'),
column = toString(column)) %>%
count(newcolumn) %>%
mutate(n = n/sum(n))

Related

Using R/dplyr to filter columns?

I have a simple Q... I have a dataset I need to filter by certain parameters. I was hoping for a solution in R?
Dummy case:
colour age animal
red 10 dog
yellow 5 cat
pink 6 cat
I want to classify this dataset e.g. by:
If colour is 'red' OR 'pink' AND age is <7 AND animal is 'cat' then = category 1.
Else category 2.
Output would be:
colour age animal category
red 10 dog 2
yellow 5 cat 2
pink 6 cat 1
Is there a way to manipulate dplyr to achieve this? I'm a clinician not a bioinformatician so go easy!

I like the case_when function in dplyr to set up more complex selections with mutate.
library(tidyverse)
df <- data.frame(colour = c("red", "yellow", "pink", "red", "pink"),
age = c(10, 5, 6, 12, 10),
animal = c("dog", "cat", "cat", "hamster", "cat"))
df
#> colour age animal
#> 1 red 10 dog
#> 2 yellow 5 cat
#> 3 pink 6 cat
#> 4 red 12 hamster
#> 5 pink 10 cat
df <- mutate(df, category = case_when(
((colour == "red" | colour == "pink") & age < 7 & animal == "cat") ~ 1,
(colour == "yellow" | age != 5 & animal == "dog") ~ 2,
(colour == "pink" | animal == "cat") ~ 3,
(TRUE) ~ 4) )
df
#> colour age animal category
#> 1 red 10 dog 2
#> 2 yellow 5 cat 2
#> 3 pink 6 cat 1
#> 4 red 12 hamster 4
#> 5 pink 10 cat 3
Created on 2021-01-17 by the reprex package (v0.3.0)

You could also manipulate this as :
df$category <- with(df,!(colour %in% c('red', 'pink') & age < 7 & animal == 'cat')) + 1
df
# colour age animal category
#1 red 10 dog 2
#2 yellow 5 cat 2
#3 pink 6 cat 1
And in dplyr :
df %>%
mutate(category = as.integer(!(colour %in% c('red', 'pink') &
age < 7 & animal == 'cat')) + 1)

R variable number of string concatenations within group_by

Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?

Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))

Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow

Filter by combination of (row) pairs

I have a dataframe in a long format and I want to filter pairs based on unique combinations of values. I have a dataset that looks like this:
id <- rep(1:4, each=2)
type <- c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
df <- data.frame(id,type)
df
id type
1 1 blue
2 1 blue
3 2 red
4 2 yellow
5 3 blue
6 3 red
7 4 red
8 4 yellow
Let's say each id is a respondent and type is a combination of treatments. Individual 1 saw two objects, both of them blue; individual 2 saw one red object and a yellow one; and so on.
How do I keep, for example, those that saw the combination "red" and "yellow"? If I filter by the combination "red" and "yellow" the resulting dataset should look like this:
id type
3 2 red
4 2 yellow
7 4 red
8 4 yellow
It should keep respondents number 2 and number 4 (only those that saw the combination "red" and "yellow"). Note that it does not keep respondent number 3 because she saw "blue" and "red" (instead of "red" and "yellow"). How do I do this?
One solution is to reshape the dataset into a wide format, filter it by column, and restack again. But I am sure there is another way to do it without reshaping the dataset. Any idea?

A dplyr solution would be:
library(dplyr)
df <- data_frame(
id = rep(1:4, each = 2),
type = c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
)
types <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(types %in% type))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Update
Allowing for the equal combinations, e.g. blue, blue, we have to change the filter-call to the following:
types2 <- c("blue", "blue")
df %>%
group_by(id) %>%
filter(sum(types2 == type) == length(types2))
#> # A tibble: 2 x 2
#> # Groups: id [1]
#> id type
#> <int> <chr>
#> 1 1 blue
#> 2 1 blue
This solution also allows different types
df %>%
group_by(id) %>%
filter(sum(types == type) == length(types))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow

Let's use all() to see if all rows within group match a set of values.
library(tidyverse)
test_filter <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(test_filter %in% type))
# A tibble: 4 x 2
# Groups: id [2]
id type
<int> <fctr>
1 2 red
2 2 yellow
3 4 red
4 4 yellow

I modified your data and did the following.
df <- data.frame(id = rep(1:4, each=3),
type <- c("blue", "blue", "green", "red", "yellow", "purple",
"blue", "orange", "yellow", "yellow", "pink", "red"),
stringsAsFactors = FALSE)
id type
1 1 blue
2 1 blue
3 1 green
4 2 red
5 2 yellow
6 2 purple
7 3 blue
8 3 orange
9 3 yellow
10 4 yellow
11 4 pink
12 4 red
As you see, there are three observations for each id. id 2 and 4 have both red and yellow. They also have non-target colors (i.e., purple, and pink). I wanted to preserve these observations. In order to achieve this task, I wrote the following code. The code can be read like this. "For each id, check if there is any red and yellow using any(). When both conditions are TRUE, keep all rows for the id."
group_by(df, id) %>%
filter(any(type == "yellow") & any(type == "red"))
id type
4 2 red
5 2 yellow
6 2 purple
10 4 yellow
11 4 pink
12 4 red

Using data.table:
library(data.table)
setDT(df)
df[, type1 := shift(type, type = "lag"), by = id]
df1 <- df[type == "yellow" & type1 == "red", id]
df <- df[id %in% df1, ]
df[, type1 := NULL]
It gives:
id type
1: 2 red
2: 2 yellow
3: 4 red
4: 4 yellow

Determining most/least amount of occurrences within subset row & column group in a data frame

I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:
df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange")
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA)
names(df) <- c("group", "A", "B")
Here is what that looks like (I have NAs in the original data, so I've included them):
group A B
1 1 yellow green
2 1 green yellow
3 1 yellow <NA>
4 2 blue blue
5 2 <NA> red
6 3 orange purple
7 3 <NA> orange
8 3 orange <NA>
In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:
group A B most least
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
I am working within a dplyr chain in the original data so I can group_by "group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr, but I figured it might be easiest given the usefulness of group_by. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?

A solution uses dplyr and tidyr. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join to merge the original data frame and prepare the desired output.
Notice that during the process I used slice to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, na.rm = TRUE) %>%
count(group, Value) %>%
arrange(group, desc(n)) %>%
group_by(group) %>%
slice(c(1, n())) %>%
mutate(Type = c("most", "least")) %>%
select(-n) %>%
spread(Type, Value) %>%
right_join(df, by = "group") %>%
select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
group A B most least
<dbl> <chr> <chr> <chr> <chr>
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple

Two options:
Reshape to long form and use summarise (or count) to aggregate, subsetting the which.max/which.min:
library(tidyverse)
df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
df %>%
gather(var, color, A:B) %>%
drop_na(color) %>%
group_by(group, color) %>%
summarise(n = n()) %>%
summarise(most = color[which.max(n)],
least = color[which.min(n)]) %>%
left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
Sort a table of values and subset it:
df %>%
group_by(group) %>%
mutate(most = last(names(sort(table(c(A, B))))),
least = first(names(sort(table(c(A, B))))))
#> # A tibble: 8 x 5
#> # Groups: group [3]
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple

Renaming factor variables if a condition is satisfied in separate column

I have a df that looks like this:
df <- data.frame(
A = sample(c("Dog", "Cat", "Cat", "Dog", "Fish", "Fish")),
B = sample(c("Brown", "Black", "Brown", "Black", "Brown", "Black")))
df
A B
1 Dog Brown
2 Cat Black
3 Cat Brown
4 Dog Black
5 Fish Brown
6 Fish Black
I want to rename (with dplyr, preferably) the factor variable "Fish" to "Dog" whenever the condition "Brown" is satisfied in the second column.
A B
1 Dog Brown
2 Cat Black
3 Cat Brown
4 Dog Black
5 Dog Brown #rename here
6 Fish Black

You could use replace:
df %>% mutate(A = replace(A, which(A == "Fish" & B == "Brown"), "Dog"))
# A B
#1 Cat Black
#2 Fish Black
#3 Dog Black
#4 Dog Brown
#5 Cat Brown
#6 Dog Brown
And here's a data.table version:
library(data.table)
setDT(df)[A == "Fish" & B == "Brown", A := "Dog"]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to group comma-separated variables in the same column? - r

Related

Using R/dplyr to filter columns?

R variable number of string concatenations within group_by

Filter by combination of (row) pairs

Determining most/least amount of occurrences within subset row & column group in a data frame

Renaming factor variables if a condition is satisfied in separate column

Categories

Resources