count observations by group based on conditions of 2 variables in R - r

This is the shor example data. Original data has many columns and rows.
head(df, 15)
ID col1 col2
1 1 green yellow
2 1 green blue
3 1 green green
4 2 yellow blue
5 2 yellow yellow
6 2 yellow blue
7 3 yellow yellow
8 3 yellow yellow
9 3 yellow blue
10 4 blue yellow
11 4 blue yellow
12 4 blue yellow
13 5 yellow yellow
14 5 yellow blue
15 5 yellow yellow
what I want to count how many different colors in col2 including the color of col1. For ex: for the ID=4, there is only 1 color in col2. if we include col1, there are 2 different colors. So output should be 2 and so on.
I tried in this way, but it doesn't give me my desired output: ID = 4 turns into 0 which is not I want. So how could I tell R to count them including color in col1?
out <- df %>%
group_by(ID) %>%
mutate(N = ifelse(col1 != col2, 1, 0))
My desired output is something like this:
ID col1 count
1 green 3
2 yellow 2
3 yellow 2
4 blue 2
5 yellow 2

You can do:
df %>%
group_by(ID, col1) %>%
summarise(count = n_distinct(col2))
ID col1 count
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
Or even:
df %>%
group_by(ID, col1) %>%
summarise_all(n_distinct)
ID col1 col2
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
To group by every three rows:
df %>%
group_by(group = gl(n()/3, 3), col1) %>%
summarise(count = n_distinct(col2))

Related

R Counting Occurences in a Column of Data Frame, Grouped by Another Column

I basically have a data frame with a column of letters and a column of colors:
x <- data.frame(col1=c("a","b","a","c","d","d","c","a","b","c"),
col2=c("red","orange","yellow","red","red","yellow","orange","yellow","red","orange"))
col1 col2
a red
b orange
a yellow
c red
d red
d yellow
c orange
a yellow
b red
c orange
My goal is to create a second data frame that counts the number of occurences of each color in col2 of x for each letter in col1. Basically:
Letters Occurences Red Orange Yellow
a 3 1 0 2
b 2 1 1 0
c 3 1 2 0
d 2 1 0 1
Right now, I just brute forced it since there are only 3 factors of col2. I used:
df <- data.frame(Letters = levels(factor(x$col1)))
df$Occurences <- table(x$col1)
df$red <- table(factor(x$col1[x$col2=="red"],levels=levels(factor(x$col1))))
df$orange <- table(factor(x$col1[x$col2=="orange"],levels=levels(factor(x$col1))))
df$yellow <- table(factor(x$col1[x$col2=="yellow"],levels=levels(factor(x$col1))))
Is there an easier way to do this, as opposed to doing each column of df one by one? Especially with a data set that has a lot more than 3 factors?
Use pivot_wider from tidyr
library(tidyr)
x %>%
pivot_wider(names_from = col2, values_from = col2, values_fn = "length", values_fill = 0)
Output:
# A tibble: 4 × 4
col1 red orange yellow
<chr> <int> <int> <int>
1 a 1 0 2
2 b 1 1 0
3 c 1 2 0
4 d 1 0 1
as.data.frame.matrix(addmargins(table(x), 2))
orange red yellow Sum
a 0 1 2 3
b 1 1 0 2
c 2 1 0 3
d 0 1 1 2

Bind table only if specific value doesn't using r

Is it possible to bind rows only where specific values are missing?
In this example I have a table with four ID's and some values. Every ID is suppose to have a corresponding value of 1-3. As you can see, some of these values are missing in table dat. To fix this I want to bind dat with dat2, but only where values from the column "value" are missing from table dat. How can I achieve this?
To be clear, I only want 12 rows in total. So for instance, ID 4 has the value 3 and cat_var "green" in table dat. By contrast, in table dat2 ID 4 has the value 3 and cat_var "red". This means that I don't want to bind that row, since there already exists a row for ID 4 and value 3 in table dat. I hope I'm making myself clear.
library(tidyverse)
Data:
id <- c(rep(1:4,3))
value <- c(rep(1:3, each = 4))
dat <- data.frame(id, value)
dat2 <- dat
dat <- dat %>%
slice(1, 3, 5, 6, 7, 8, 10, 12)
dat2$cat_var <- c(rep("orange", 5), rep("green", 5), rep("red", 2))
dat$cat_var <- c(rep("orange", 3), rep("green", 5))
Desired result:
# A tibble: 12 x 3
id value cat_var
<int> <int> <chr>
1 1 1 orange
2 2 1 orange
3 3 1 orange
4 4 1 orange
5 1 2 orange
6 2 2 green
7 3 2 green
8 4 2 green
9 1 3 green
10 2 3 green
11 3 3 red
12 4 3 green
dat %>% bind_rows(dat2) %>% distinct(id, value, .keep_all = T) %>%
arrange(value, id)
results in :
id value cat_var
1 1 1 orange
2 2 1 orange
3 3 1 orange
4 4 1 orange
5 1 2 orange
6 2 2 green
7 3 2 green
8 4 2 green
9 1 3 green
10 2 3 green
11 3 3 red
12 4 3 green
You dont need the arrange (it is just to get the exact same dataframe as the disered result).
Using base R :
row bind dat and dat2 and using duplicated keep unique rows.
result <- rbind(dat, dat2)
result <- result[!duplicated(result[, c('id', 'value')]), ]
result
# id value cat_var
#1 1 1 orange
#2 3 1 orange
#3 1 2 orange
#4 2 2 green
#5 3 2 green
#6 4 2 green
#7 2 3 green
#8 4 3 green
#10 2 1 orange
#12 4 1 orange
#17 1 3 green
#19 3 3 red

How to group and summarise each data frame in a list of data frames

I have a list of data frames:
df1 <- data.frame(one = c('red','blue','green','red','red','blue','green','green'),
one.1 = as.numeric(c('1','1','0','1','1','0','0','0')))
df2 <- data.frame(two = c('red','yellow','green','yellow','green','blue','blue','red'),
two.2 = as.numeric(c('0','1','1','0','0','0','1','1')))
df3 <- data.frame(three = c('yellow','yellow','green','green','green','white','blue','white'),
three.3 = as.numeric(c('1','0','0','1','1','0','0','1')))
all <- list(df1,df2,df3)
I need to group each data frame by the first column and summarise the second column.
Individually I would do something like this:
library(dplyr)
df1 <- df1 %>%
group_by(one) %>%
summarise(sum = sum(one.1))
However I'm having trouble figuring out how to iterate over each item in the list.
I've thought of using a loop:
for(i in 1:3){
all[i] <- all[i] %>%
group_by_at(1) %>%
summarise()
}
But I can't figure out how to specify a column to sum in the summarise() function (this loop is likely wrong in other ways than that anyway).
Ideally I need the output to be another list with each item being the summarised data, like so:
[[1]]
# A tibble: 3 x 2
one sum
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two sum
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three sum
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Would really appreciate any help!
Using purrr::map and summarise at columns contain a letteral dot \\. using matches helper.
library(dplyr)
library(purrr)
map(all, ~.x %>%
#group_by_at(vars(matches('one$|two$|three$'))) %>% #column ends with one, two, or three
group_by_at(1) %>%
summarise_at(vars(matches('\\.')),sum))
#summarise_at(vars(matches('\\.')),list(sum=~sum))) #2nd option
[[1]]
# A tibble: 3 x 2
one one.1
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two two.2
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three three.3
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Here's a base R solution:
lapply(all, function(DF) aggregate(list(added = DF[, 2]), by = DF[, 1, drop = F], FUN = sum))
[[1]]
one added
1 blue 1
2 green 0
3 red 3
[[2]]
two added
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
three added
1 blue 0
2 green 2
3 white 1
4 yellow 1
Another approach would be to bind the lists into one. Here I use data.table and avoid using the names. The only problem is that this may mess up factors but I'm not sure that's an issue in your case.
library(data.table)
rbindlist(all, use.names = F, idcol = 'id'
)[, .(added = sum(one.1)), by = .(id, color = one)]
id color added
1: 1 red 3
2: 1 blue 1
3: 1 green 0
4: 2 red 1
5: 2 yellow 1
6: 2 green 1
7: 2 blue 1
8: 3 yellow 1
9: 3 green 2
10: 3 white 1
11: 3 blue 0

tidyverse grouping combining small groups into "other"

Let's say I want summarize a certain data frame column:
> starwars %>% count(eye_color)
# A tibble: 15 x 2
eye_color n
<chr> <int>
1 black 10
2 blue 19
3 blue-gray 1
4 brown 21
5 dark 1
6 gold 1
7 green, yellow 1
8 hazel 3
9 orange 8
10 pink 1
11 red 5
12 red, blue 1
13 unknown 3
14 white 1
15 yellow 11
There are a lot of small categories, such as "blue-gray" or "pink". I would like to merge them all into "other". There is a multi-step process to do this:
starwars %>%
add_count(eye_color) %>%
mutate(eye_color = if_else(n < 5, "other", eye_color)) %>%
count(eye_color)
There is also a way to do it with a single command. I saw this trick before somewhere, but now cannot find it.
Writing up #Jordan's suggestion:
Updated: with Camille's fix:
starwars %>% mutate(eye_color_grp = as.factor(eye_color) %>%
forcats::fct_lump_min(min = 5, other_level = "Other")) %>%
count(eye_color_grp, sort = TRUE)
Link:https://forcats.tidyverse.org/reference/fct_lump.html

Select the n most frequent values in a variable

I would like to find the most common values in a column in a data frame. I assume using table would be the best way to do this? I then want to filter/subset my data frame to only include these top-n values.
An example of my data frame is as follows. Here I want to find e.g. the top 2 IDs.
ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange
I therefore want to output the following:
Top 2 values of ID are:
A and C
I will then select the rows corresponding to ID A and C:
ID col
A blue
A purple
A green
C red
C blue
C yellow
C orange
You can try a tidyverse. Add the counts of ID's, then filter for the top two (using < 3) or top ten (using < 11):
library(tidyverse)
d %>%
add_count(ID) %>%
filter(dense_rank(-n) < 3)
# A tibble: 7 x 3
ID col n
<fct> <fct> <int>
1 A blue 3
2 A purple 3
3 A green 3
4 C red 4
5 C blue 4
6 C yellow 4
7 C orange 4
Data
d <- read.table(text="ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange", header=T)
We can count the number of values using table, sort them in decreasing order and select first 2 (or 10) values, get the corresponding ID's and subset those ID's from the data frame.
df[df$ID %in% names(sort(table(df$ID), decreasing = TRUE)[1:2]), ]
# ID col
#1 A blue
#2 A purple
#3 A green
#6 C red
#7 C blue
#8 C yellow
#9 C orange
With the tidyverse and its top_n :
library(tidyverse)
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2)
Selecting by n()
# A tibble: 2 x 2
ID `n()`
<fct> <int>
1 A 3
2 C 4
To complete with the subset :
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2) %>%
{ filter(d, ID %in% .$ID) }
Selecting by n()
ID col
1 A blue
2 A purple
3 A green
4 C red
5 C blue
6 C yellow
7 C orange
(we use the braces because we don't feed the left hand side result as the first argument of the filter)

Resources