Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?
Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))
Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow
Related
Here is my false data:
#> id column
#> 1 blue, red, dog, cat
#> 2 red, blue, dog
#> 3 blue
#> 4 red
#> 5 dog, cat
#> 6 cat
#> 7 red, cat
#> 8 dog
#> 9 cat, red
#> 10 blue, cat
I want to tell R for example that dog and cat = animal and red and blue = colour. I want to basically count the number (and eventually percentage) of animals, colours and both.
#> id column newcolumn
#> 1 blue, red, dog, cat both
#> 2 red, blue, dog both
#> 3 blue colour
#> 4 red colour
#> 5 dog, cat animal
#> 6 cat animal
#> 7 red, cat both
#> 8 dog animal
#> 9 cat, red both
#> 10 blue, cat both
So far I have only been able to total up the number of red, blue, dog and cat by doing the following:
column.string<-paste(df$column, collapse=",")
column.vector<-strsplit(column.string, ",")[[1]]
column.vector.clean<-gsub(" ", "", column.vector)
table(column.vector.clean)
Would be very grateful for help, here is my sample false data:
df <- data.frame(id = c(1:10),
column = c("blue, red, dog, cat", "red, blue, dog", "blue", "red", "dog, cat", "cat", "red, cat", "dog", "cat, red", "blue, cat"))
You can define all possible animals and colours in a vector. Split column on comma and test :
animal <- c('dog', 'cat')
colour <- c('red', 'blue')
df$newcolumn <- sapply(strsplit(df$column, ',\\s*'), function(x) {
x <- x[x != "NA"]
if(!length(x)) return(NA)
if(all(x %in% animal)) 'animal'
else if(all(x %in% colour)) 'colour'
else 'both'
})
df
# id column newcolumn
#1 1 blue, red, dog, cat both
#2 2 red, blue, dog both
#3 3 blue colour
#4 4 red colour
#5 5 dog, cat animal
#6 6 cat animal
#7 7 red, cat both
#8 8 dog animal
#9 9 cat, red both
#10 10 blue, cat both
To calculate the proportion, you can then use prop.table with table :
prop.table(table(df$newcolumn, useNA = "ifany"))
#animal both colour
# 0.3 0.5 0.2
Using dplyr, we can separate the rows on comma, for each id create a newcolumn based on conditions and calculate the proportions.
library(dplyr)
df %>%
tidyr::separate_rows(column, sep = ',\\s*') %>%
group_by(id) %>%
summarise(newcolumn = case_when(all(column %in% animal) ~ 'animal',
all(column %in% colour) ~ 'colour',
TRUE ~ 'both'),
column = toString(column)) %>%
count(newcolumn) %>%
mutate(n = n/sum(n))
This is the shor example data. Original data has many columns and rows.
head(df, 15)
ID col1 col2
1 1 green yellow
2 1 green blue
3 1 green green
4 2 yellow blue
5 2 yellow yellow
6 2 yellow blue
7 3 yellow yellow
8 3 yellow yellow
9 3 yellow blue
10 4 blue yellow
11 4 blue yellow
12 4 blue yellow
13 5 yellow yellow
14 5 yellow blue
15 5 yellow yellow
what I want to count how many different colors in col2 including the color of col1. For ex: for the ID=4, there is only 1 color in col2. if we include col1, there are 2 different colors. So output should be 2 and so on.
I tried in this way, but it doesn't give me my desired output: ID = 4 turns into 0 which is not I want. So how could I tell R to count them including color in col1?
out <- df %>%
group_by(ID) %>%
mutate(N = ifelse(col1 != col2, 1, 0))
My desired output is something like this:
ID col1 count
1 green 3
2 yellow 2
3 yellow 2
4 blue 2
5 yellow 2
You can do:
df %>%
group_by(ID, col1) %>%
summarise(count = n_distinct(col2))
ID col1 count
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
Or even:
df %>%
group_by(ID, col1) %>%
summarise_all(n_distinct)
ID col1 col2
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
To group by every three rows:
df %>%
group_by(group = gl(n()/3, 3), col1) %>%
summarise(count = n_distinct(col2))
I have a dataframe in a long format and I want to filter pairs based on unique combinations of values. I have a dataset that looks like this:
id <- rep(1:4, each=2)
type <- c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
df <- data.frame(id,type)
df
id type
1 1 blue
2 1 blue
3 2 red
4 2 yellow
5 3 blue
6 3 red
7 4 red
8 4 yellow
Let's say each id is a respondent and type is a combination of treatments. Individual 1 saw two objects, both of them blue; individual 2 saw one red object and a yellow one; and so on.
How do I keep, for example, those that saw the combination "red" and "yellow"? If I filter by the combination "red" and "yellow" the resulting dataset should look like this:
id type
3 2 red
4 2 yellow
7 4 red
8 4 yellow
It should keep respondents number 2 and number 4 (only those that saw the combination "red" and "yellow"). Note that it does not keep respondent number 3 because she saw "blue" and "red" (instead of "red" and "yellow"). How do I do this?
One solution is to reshape the dataset into a wide format, filter it by column, and restack again. But I am sure there is another way to do it without reshaping the dataset. Any idea?
A dplyr solution would be:
library(dplyr)
df <- data_frame(
id = rep(1:4, each = 2),
type = c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
)
types <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(types %in% type))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Update
Allowing for the equal combinations, e.g. blue, blue, we have to change the filter-call to the following:
types2 <- c("blue", "blue")
df %>%
group_by(id) %>%
filter(sum(types2 == type) == length(types2))
#> # A tibble: 2 x 2
#> # Groups: id [1]
#> id type
#> <int> <chr>
#> 1 1 blue
#> 2 1 blue
This solution also allows different types
df %>%
group_by(id) %>%
filter(sum(types == type) == length(types))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Let's use all() to see if all rows within group match a set of values.
library(tidyverse)
test_filter <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(test_filter %in% type))
# A tibble: 4 x 2
# Groups: id [2]
id type
<int> <fctr>
1 2 red
2 2 yellow
3 4 red
4 4 yellow
I modified your data and did the following.
df <- data.frame(id = rep(1:4, each=3),
type <- c("blue", "blue", "green", "red", "yellow", "purple",
"blue", "orange", "yellow", "yellow", "pink", "red"),
stringsAsFactors = FALSE)
id type
1 1 blue
2 1 blue
3 1 green
4 2 red
5 2 yellow
6 2 purple
7 3 blue
8 3 orange
9 3 yellow
10 4 yellow
11 4 pink
12 4 red
As you see, there are three observations for each id. id 2 and 4 have both red and yellow. They also have non-target colors (i.e., purple, and pink). I wanted to preserve these observations. In order to achieve this task, I wrote the following code. The code can be read like this. "For each id, check if there is any red and yellow using any(). When both conditions are TRUE, keep all rows for the id."
group_by(df, id) %>%
filter(any(type == "yellow") & any(type == "red"))
id type
4 2 red
5 2 yellow
6 2 purple
10 4 yellow
11 4 pink
12 4 red
Using data.table:
library(data.table)
setDT(df)
df[, type1 := shift(type, type = "lag"), by = id]
df1 <- df[type == "yellow" & type1 == "red", id]
df <- df[id %in% df1, ]
df[, type1 := NULL]
It gives:
id type
1: 2 red
2: 2 yellow
3: 4 red
4: 4 yellow
I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:
df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange")
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA)
names(df) <- c("group", "A", "B")
Here is what that looks like (I have NAs in the original data, so I've included them):
group A B
1 1 yellow green
2 1 green yellow
3 1 yellow <NA>
4 2 blue blue
5 2 <NA> red
6 3 orange purple
7 3 <NA> orange
8 3 orange <NA>
In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:
group A B most least
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
I am working within a dplyr chain in the original data so I can group_by "group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr, but I figured it might be easiest given the usefulness of group_by. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?
A solution uses dplyr and tidyr. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join to merge the original data frame and prepare the desired output.
Notice that during the process I used slice to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, na.rm = TRUE) %>%
count(group, Value) %>%
arrange(group, desc(n)) %>%
group_by(group) %>%
slice(c(1, n())) %>%
mutate(Type = c("most", "least")) %>%
select(-n) %>%
spread(Type, Value) %>%
right_join(df, by = "group") %>%
select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
group A B most least
<dbl> <chr> <chr> <chr> <chr>
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
Two options:
Reshape to long form and use summarise (or count) to aggregate, subsetting the which.max/which.min:
library(tidyverse)
df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
df %>%
gather(var, color, A:B) %>%
drop_na(color) %>%
group_by(group, color) %>%
summarise(n = n()) %>%
summarise(most = color[which.max(n)],
least = color[which.min(n)]) %>%
left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
Sort a table of values and subset it:
df %>%
group_by(group) %>%
mutate(most = last(names(sort(table(c(A, B))))),
least = first(names(sort(table(c(A, B))))))
#> # A tibble: 8 x 5
#> # Groups: group [3]
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
I have a data frame such as data
data = data.frame(ID = as.factor(c("A", "A", "B","B","C","C")),
var.color= as.factor(c("red", "blue", "green", "red", "green", "yellow")))
I wonder whether it is possible to get the levels of each group in ID (e.g. A, B, C) and create a variable that pastes them. I have attempted to do so by running the following:
data %>% group_by(ID) %>%
mutate(ex = paste(droplevels(var.color), sep = "_"))
That yields:
Source: local data frame [6 x 3]
Groups: ID [3]
ID var.color ex
<fctr> <fctr> <chr>
1 A red red
2 A blue blue
3 B green red
4 B red red
5 C green green
6 C yellow yellow
However, my desired data.frame should be something like:
ID var.color ex
<fctr> <fctr> <chr>
1 A red red_blue
2 A blue red_blue
3 B green green_red
4 B red green_red
5 C green green_yellow
6 C yellow green_yellow
Basically, you need collapse instead of sep
Instead of dropping levels , you can just paste the text together grouped by ID
library(dplyr)
data %>% group_by(ID) %>%
mutate(ex = paste(var.color, collapse = "_"))
# ID var.color ex
# <fctr> <fctr> <chr>
#1 A red red_blue
#2 A blue red_blue
#3 B green green_red
#4 B red green_red
#5 C green green_yellow
#6 C yellow green_yellow
You can do the same by using loops
for(i in unique(data$ID)){
data$ex[data$ID==i] <- paste0(data$var.color[data$ID==i], collapse = "_")
}
> data
ID var.color ex
1 A red red_blue
2 A blue red_blue
3 B green green_red
4 B red green_red
5 C green green_yellow
6 C yellow green_yellow