I have a simple Q... I have a dataset I need to filter by certain parameters. I was hoping for a solution in R?
Dummy case:
colour age animal
red 10 dog
yellow 5 cat
pink 6 cat
I want to classify this dataset e.g. by:
If colour is 'red' OR 'pink' AND age is <7 AND animal is 'cat' then = category 1.
Else category 2.
Output would be:
colour age animal category
red 10 dog 2
yellow 5 cat 2
pink 6 cat 1
Is there a way to manipulate dplyr to achieve this? I'm a clinician not a bioinformatician so go easy!
I like the case_when function in dplyr to set up more complex selections with mutate.
library(tidyverse)
df <- data.frame(colour = c("red", "yellow", "pink", "red", "pink"),
age = c(10, 5, 6, 12, 10),
animal = c("dog", "cat", "cat", "hamster", "cat"))
df
#> colour age animal
#> 1 red 10 dog
#> 2 yellow 5 cat
#> 3 pink 6 cat
#> 4 red 12 hamster
#> 5 pink 10 cat
df <- mutate(df, category = case_when(
((colour == "red" | colour == "pink") & age < 7 & animal == "cat") ~ 1,
(colour == "yellow" | age != 5 & animal == "dog") ~ 2,
(colour == "pink" | animal == "cat") ~ 3,
(TRUE) ~ 4) )
df
#> colour age animal category
#> 1 red 10 dog 2
#> 2 yellow 5 cat 2
#> 3 pink 6 cat 1
#> 4 red 12 hamster 4
#> 5 pink 10 cat 3
Created on 2021-01-17 by the reprex package (v0.3.0)
You could also manipulate this as :
df$category <- with(df,!(colour %in% c('red', 'pink') & age < 7 & animal == 'cat')) + 1
df
# colour age animal category
#1 red 10 dog 2
#2 yellow 5 cat 2
#3 pink 6 cat 1
And in dplyr :
df %>%
mutate(category = as.integer(!(colour %in% c('red', 'pink') &
age < 7 & animal == 'cat')) + 1)
Related
Here is my false data:
#> id column
#> 1 blue, red, dog, cat
#> 2 red, blue, dog
#> 3 blue
#> 4 red
#> 5 dog, cat
#> 6 cat
#> 7 red, cat
#> 8 dog
#> 9 cat, red
#> 10 blue, cat
I want to tell R for example that dog and cat = animal and red and blue = colour. I want to basically count the number (and eventually percentage) of animals, colours and both.
#> id column newcolumn
#> 1 blue, red, dog, cat both
#> 2 red, blue, dog both
#> 3 blue colour
#> 4 red colour
#> 5 dog, cat animal
#> 6 cat animal
#> 7 red, cat both
#> 8 dog animal
#> 9 cat, red both
#> 10 blue, cat both
So far I have only been able to total up the number of red, blue, dog and cat by doing the following:
column.string<-paste(df$column, collapse=",")
column.vector<-strsplit(column.string, ",")[[1]]
column.vector.clean<-gsub(" ", "", column.vector)
table(column.vector.clean)
Would be very grateful for help, here is my sample false data:
df <- data.frame(id = c(1:10),
column = c("blue, red, dog, cat", "red, blue, dog", "blue", "red", "dog, cat", "cat", "red, cat", "dog", "cat, red", "blue, cat"))
You can define all possible animals and colours in a vector. Split column on comma and test :
animal <- c('dog', 'cat')
colour <- c('red', 'blue')
df$newcolumn <- sapply(strsplit(df$column, ',\\s*'), function(x) {
x <- x[x != "NA"]
if(!length(x)) return(NA)
if(all(x %in% animal)) 'animal'
else if(all(x %in% colour)) 'colour'
else 'both'
})
df
# id column newcolumn
#1 1 blue, red, dog, cat both
#2 2 red, blue, dog both
#3 3 blue colour
#4 4 red colour
#5 5 dog, cat animal
#6 6 cat animal
#7 7 red, cat both
#8 8 dog animal
#9 9 cat, red both
#10 10 blue, cat both
To calculate the proportion, you can then use prop.table with table :
prop.table(table(df$newcolumn, useNA = "ifany"))
#animal both colour
# 0.3 0.5 0.2
Using dplyr, we can separate the rows on comma, for each id create a newcolumn based on conditions and calculate the proportions.
library(dplyr)
df %>%
tidyr::separate_rows(column, sep = ',\\s*') %>%
group_by(id) %>%
summarise(newcolumn = case_when(all(column %in% animal) ~ 'animal',
all(column %in% colour) ~ 'colour',
TRUE ~ 'both'),
column = toString(column)) %>%
count(newcolumn) %>%
mutate(n = n/sum(n))
Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?
Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))
Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow
I am currently working with a large data set that records daily data at multiple locations and I would like to summarize the daily data to have one output giving the maximum warning level on that day (categories red/yellow/none).
Consider the following set up:
location = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D",4) , rep("E", 4))
date = rep(c("19991230", "19991231", "20000101", "20000102"), 5)
warning = c("Red", "None", "None", "None", "Yellow", "None", "Red", "None", "Yellow", "Yellow", "None", "Yellow", "None", "None", "None", "None", "Yellow", "None", "None", "None")
data = data.frame(location, date, warning)
I am trying to create a new column that will show "None" if no warnings occur on each specific day, "Yellow" if one or more yellow warning occurs (except if one or more "Red" warning occurs that same day) in which case the "Red" output takes priority.
I have considered using aggregate by date but I am unsure which function to apply. I have also tried for loops over each date to try and !count "None" warnings to at least narrow it down but without any luck. Perhaps I need to use ifelse and a for loop over the dates? Poor attempts below:
aggregate(data, by=date, FUN)
or
data <- data %>%
group_by(date) %>%
mutate(day_warning_type = case_when(
warning != "None" ~ TRUE, TRUE ~ FALSE
)) %>%
ungroup()
Hopefully someone can at least help me in the right direction as I haven't made much progress so far as I am struggling to know how to work with character variables.
You were on the right track with the group_by. It's maybe simpler to create a second dataset that summarizes by date, and then merge this back into the main dataset. See below
# Summarize each date based on number of Yellow/Red/None warnings
data_sum <- data %>%
group_by(date) %>%
summarize(
day_warning_none = length(which(warning == "None")),
day_warning_yellow = length(which(warning == "Yellow")),
day_warning_red = length(which(warning == "Red"))
) %>%
ungroup() %>%
# Create a summary measure
mutate(
day_warning = case_when(
day_warning_red > 0 ~ "Red",
day_warning_yellow > 0 ~ "Yellow",
TRUE ~ "None"
)
)
head(data.sum)
date day_warning_none day_warning_yellow day_warning_red day_warning
<fct> <int> <int> <int> <chr>
1 19991230 1 3 1 Red
2 19991231 4 1 0 Yellow
3 20000101 4 0 1 Red
4 20000102 4 1 0 Yellow
# Merge back in
data2 <- left_join(data, data_sum) %>%
arrange(date)
head(data2, 10)
location date warning day_warning_none day_warning_yellow day_warning_red day_warning
1 A 19991230 Red 1 3 1 Red
2 B 19991230 Yellow 1 3 1 Red
3 C 19991230 Yellow 1 3 1 Red
4 D 19991230 None 1 3 1 Red
5 E 19991230 Yellow 1 3 1 Red
6 A 19991231 None 4 1 0 Yellow
7 B 19991231 None 4 1 0 Yellow
8 C 19991231 Yellow 4 1 0 Yellow
9 D 19991231 None 4 1 0 Yellow
10 E 19991231 None 4 1 0 Yellow
you can create counts on the warnings and create the flag based on the counts:
data %>%
group_by(date) %>%
mutate(day_warning_type = case_when(
sum(warning == "Red") > 0 ~ "Red",
sum(warning == "Red") == 0 & sum(warning == "Yellow") > 0 ~ "Yellow",
TRUE ~ "None"
)) %>%
ungroup()
# A tibble: 20 x 4
location date warning day_warning_type
<fct> <fct> <fct> <chr>
1 A 19991230 Red Red
2 A 19991231 None Yellow
3 A 20000101 None Red
4 A 20000102 None Yellow
5 B 19991230 Yellow Red
6 B 19991231 None Yellow
7 B 20000101 Red Red
8 B 20000102 None Yellow
9 C 19991230 Yellow Red
10 C 19991231 Yellow Yellow
11 C 20000101 None Red
12 C 20000102 Yellow Yellow
13 D 19991230 None Red
14 D 19991231 None Yellow
15 D 20000101 None Red
16 D 20000102 None Yellow
17 E 19991230 Yellow Red
18 E 19991231 None Yellow
19 E 20000101 None Red
20 E 20000102 None Yellow
I have a dataframe in a long format and I want to filter pairs based on unique combinations of values. I have a dataset that looks like this:
id <- rep(1:4, each=2)
type <- c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
df <- data.frame(id,type)
df
id type
1 1 blue
2 1 blue
3 2 red
4 2 yellow
5 3 blue
6 3 red
7 4 red
8 4 yellow
Let's say each id is a respondent and type is a combination of treatments. Individual 1 saw two objects, both of them blue; individual 2 saw one red object and a yellow one; and so on.
How do I keep, for example, those that saw the combination "red" and "yellow"? If I filter by the combination "red" and "yellow" the resulting dataset should look like this:
id type
3 2 red
4 2 yellow
7 4 red
8 4 yellow
It should keep respondents number 2 and number 4 (only those that saw the combination "red" and "yellow"). Note that it does not keep respondent number 3 because she saw "blue" and "red" (instead of "red" and "yellow"). How do I do this?
One solution is to reshape the dataset into a wide format, filter it by column, and restack again. But I am sure there is another way to do it without reshaping the dataset. Any idea?
A dplyr solution would be:
library(dplyr)
df <- data_frame(
id = rep(1:4, each = 2),
type = c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
)
types <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(types %in% type))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Update
Allowing for the equal combinations, e.g. blue, blue, we have to change the filter-call to the following:
types2 <- c("blue", "blue")
df %>%
group_by(id) %>%
filter(sum(types2 == type) == length(types2))
#> # A tibble: 2 x 2
#> # Groups: id [1]
#> id type
#> <int> <chr>
#> 1 1 blue
#> 2 1 blue
This solution also allows different types
df %>%
group_by(id) %>%
filter(sum(types == type) == length(types))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Let's use all() to see if all rows within group match a set of values.
library(tidyverse)
test_filter <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(test_filter %in% type))
# A tibble: 4 x 2
# Groups: id [2]
id type
<int> <fctr>
1 2 red
2 2 yellow
3 4 red
4 4 yellow
I modified your data and did the following.
df <- data.frame(id = rep(1:4, each=3),
type <- c("blue", "blue", "green", "red", "yellow", "purple",
"blue", "orange", "yellow", "yellow", "pink", "red"),
stringsAsFactors = FALSE)
id type
1 1 blue
2 1 blue
3 1 green
4 2 red
5 2 yellow
6 2 purple
7 3 blue
8 3 orange
9 3 yellow
10 4 yellow
11 4 pink
12 4 red
As you see, there are three observations for each id. id 2 and 4 have both red and yellow. They also have non-target colors (i.e., purple, and pink). I wanted to preserve these observations. In order to achieve this task, I wrote the following code. The code can be read like this. "For each id, check if there is any red and yellow using any(). When both conditions are TRUE, keep all rows for the id."
group_by(df, id) %>%
filter(any(type == "yellow") & any(type == "red"))
id type
4 2 red
5 2 yellow
6 2 purple
10 4 yellow
11 4 pink
12 4 red
Using data.table:
library(data.table)
setDT(df)
df[, type1 := shift(type, type = "lag"), by = id]
df1 <- df[type == "yellow" & type1 == "red", id]
df <- df[id %in% df1, ]
df[, type1 := NULL]
It gives:
id type
1: 2 red
2: 2 yellow
3: 4 red
4: 4 yellow
I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:
df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange")
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA)
names(df) <- c("group", "A", "B")
Here is what that looks like (I have NAs in the original data, so I've included them):
group A B
1 1 yellow green
2 1 green yellow
3 1 yellow <NA>
4 2 blue blue
5 2 <NA> red
6 3 orange purple
7 3 <NA> orange
8 3 orange <NA>
In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:
group A B most least
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
I am working within a dplyr chain in the original data so I can group_by "group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr, but I figured it might be easiest given the usefulness of group_by. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?
A solution uses dplyr and tidyr. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join to merge the original data frame and prepare the desired output.
Notice that during the process I used slice to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, na.rm = TRUE) %>%
count(group, Value) %>%
arrange(group, desc(n)) %>%
group_by(group) %>%
slice(c(1, n())) %>%
mutate(Type = c("most", "least")) %>%
select(-n) %>%
spread(Type, Value) %>%
right_join(df, by = "group") %>%
select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
group A B most least
<dbl> <chr> <chr> <chr> <chr>
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
Two options:
Reshape to long form and use summarise (or count) to aggregate, subsetting the which.max/which.min:
library(tidyverse)
df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
df %>%
gather(var, color, A:B) %>%
drop_na(color) %>%
group_by(group, color) %>%
summarise(n = n()) %>%
summarise(most = color[which.max(n)],
least = color[which.min(n)]) %>%
left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
Sort a table of values and subset it:
df %>%
group_by(group) %>%
mutate(most = last(names(sort(table(c(A, B))))),
least = first(names(sort(table(c(A, B))))))
#> # A tibble: 8 x 5
#> # Groups: group [3]
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple