I would like to find the most common values in a column in a data frame. I assume using table would be the best way to do this? I then want to filter/subset my data frame to only include these top-n values.
An example of my data frame is as follows. Here I want to find e.g. the top 2 IDs.
ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange
I therefore want to output the following:
Top 2 values of ID are:
A and C
I will then select the rows corresponding to ID A and C:
ID col
A blue
A purple
A green
C red
C blue
C yellow
C orange
You can try a tidyverse. Add the counts of ID's, then filter for the top two (using < 3) or top ten (using < 11):
library(tidyverse)
d %>%
add_count(ID) %>%
filter(dense_rank(-n) < 3)
# A tibble: 7 x 3
ID col n
<fct> <fct> <int>
1 A blue 3
2 A purple 3
3 A green 3
4 C red 4
5 C blue 4
6 C yellow 4
7 C orange 4
Data
d <- read.table(text="ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange", header=T)
We can count the number of values using table, sort them in decreasing order and select first 2 (or 10) values, get the corresponding ID's and subset those ID's from the data frame.
df[df$ID %in% names(sort(table(df$ID), decreasing = TRUE)[1:2]), ]
# ID col
#1 A blue
#2 A purple
#3 A green
#6 C red
#7 C blue
#8 C yellow
#9 C orange
With the tidyverse and its top_n :
library(tidyverse)
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2)
Selecting by n()
# A tibble: 2 x 2
ID `n()`
<fct> <int>
1 A 3
2 C 4
To complete with the subset :
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2) %>%
{ filter(d, ID %in% .$ID) }
Selecting by n()
ID col
1 A blue
2 A purple
3 A green
4 C red
5 C blue
6 C yellow
7 C orange
(we use the braces because we don't feed the left hand side result as the first argument of the filter)
Related
Background
I've got this dataframe df:
df <- data.frame(ID = c("a","a","a","b", "c","c","c","c"),
event = c("red","black","blue","white", "orange","red","gray","green"),
stringsAsFactors=FALSE)
It's got some people in it (ID) and a description of an event. I'd like to make a new variable condition that indicates 1 or 0 based on whether any of the cells for a given ID contain either "red" or "blue".
The Problem
I can get this work, but only for the matching row. What I'd like is that if any of a person's cells in event contain "red" or "blue", all their cells in condition should be marked 1. In other words, I'd like this:
ID event condition
a red 1
a black 1
a blue 1
b white 0
c orange 1
c red 1
c gray 1
c green 1
What I've tried
So far, I've used this code to get this result:
df <- df %>%
mutate(condition = ifelse(df$event %in% c("red","blue"), 1, 0))
ID event condition
a red 1
a black 0
a blue 1
b white 0
c orange 0
c red 1
c gray 0
c green 0
In other words, the rows that match are marked 1, but I'd like all rows for an ID with any matching row to be marked 1.
We need any wrapped around the logical vector from %in%- in addition the arguments can be reversed (In the OPs code, it is return 1 where it matches the elements 'red' or 'blue', leaving the others 0.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(condition = +(any(c('red', 'blue') %in% event))) %>%
ungroup
-output
# A tibble: 8 × 3
ID event condition
<chr> <chr> <int>
1 a red 1
2 a black 1
3 a blue 1
4 b white 0
5 c orange 1
6 c red 1
7 c gray 1
8 c green 1
Here is an alternative approach:
library(dplyr)
library(stringr)
df %>%
group_by(ID) %>%
mutate(condition = if_else(str_detect(event, paste(c("red", "blue"), collapse = "|")), 1, 0))
ID event condition
<chr> <chr> <dbl>
1 a red 1
2 a black 0
3 a blue 1
4 b white 0
5 c orange 0
6 c red 1
7 c gray 0
8 c green 0
I have a data set looks like this:
and I would like to get a summary data set that will looks like this:
what should i do? Thanks. The sample.data can be build through following codes:
ID<- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18")
Group<-c("A","B","C","D","D","D","A","B","D","C","B","D","A","A","C","B","B","B")
Color<-c("Green","Yellow","Red","Red","Red","Yellow","Green","Green","Yellow","Red","Red","Yellow","Yellow","Yellow","Green","Red","Red","Green")
Realy_Love<-c("Y","N","Y","Y","N","N","Y","Y","Y","N","N","Y","N","Y","N","Y","N","Y")
Sample.data <- data.frame(ID, Group, Color, Realy_Love)
You can use dplyr and group by the following items:
Sample.data %>%
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n())
# Group Color Realy_Love Obs
# <chr> <chr> <chr> <int>
# 1 A Green Y 2
# 2 A Yellow N 1
# 3 A Yellow Y 1
# 4 B Green Y 2
# 5 B Red N 2
# 6 B Red Y 1
# 7 B Yellow N 1
# 8 C Green N 1
# 9 C Red N 1
# 10 C Red Y 1
# 11 D Red N 1
# 12 D Red Y 1
# 13 D Yellow N 1
# 14 D Yellow Y 2
Use dplyr from the Tidyverse to get a summary. You can then use arrange() to sort by Color or another variable.
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n()) %>%
arrange(Color)
With dplyr, you even don't need to group the columns, just use one step solution with the count() function:
Sample.data %>%
count(Group, Color, Realy_Love, sort = TRUE)
The optional sort = TRUE argument says to sort with descending order from the most frequent:
Group Color Realy_Love n
1 A Green Y 2
2 B Green Y 2
3 B Red N 2
4 D Yellow Y 2
5 A Yellow N 1
6 A Yellow Y 1
7 B Red Y 1
8 B Yellow N 1
9 C Green N 1
10 C Red N 1
11 C Red Y 1
12 D Red N 1
13 D Red Y 1
14 D Yellow N 1
This is the shor example data. Original data has many columns and rows.
head(df, 15)
ID col1 col2
1 1 green yellow
2 1 green blue
3 1 green green
4 2 yellow blue
5 2 yellow yellow
6 2 yellow blue
7 3 yellow yellow
8 3 yellow yellow
9 3 yellow blue
10 4 blue yellow
11 4 blue yellow
12 4 blue yellow
13 5 yellow yellow
14 5 yellow blue
15 5 yellow yellow
what I want to count how many different colors in col2 including the color of col1. For ex: for the ID=4, there is only 1 color in col2. if we include col1, there are 2 different colors. So output should be 2 and so on.
I tried in this way, but it doesn't give me my desired output: ID = 4 turns into 0 which is not I want. So how could I tell R to count them including color in col1?
out <- df %>%
group_by(ID) %>%
mutate(N = ifelse(col1 != col2, 1, 0))
My desired output is something like this:
ID col1 count
1 green 3
2 yellow 2
3 yellow 2
4 blue 2
5 yellow 2
You can do:
df %>%
group_by(ID, col1) %>%
summarise(count = n_distinct(col2))
ID col1 count
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
Or even:
df %>%
group_by(ID, col1) %>%
summarise_all(n_distinct)
ID col1 col2
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
To group by every three rows:
df %>%
group_by(group = gl(n()/3, 3), col1) %>%
summarise(count = n_distinct(col2))
I have the following df:
name color
A red
B red
C green
D red
E red
F red
And I want to test the values in the 'color' column to see if they're the same as the values in the row above and write to a new column... I can do so using the following:
> df$same <- ifelse(df$color == df$color[c(NA,1:(nrow(df)-1))], 1, 0)
To give me:
name color same
A red NA
B red 1
C green 0
D red 0
E red 1
F red 1
But is there a cleaner way to do it? (I use this all the time)...
Adding to Rafael's answer, you can use ifelse with dplyr::mutate:
> dt <- data_frame(name = c('A', 'B', 'C', 'D', 'E', 'F'), color = c('red', 'red', 'green', 'red', 'red', 'red'))
> dt
# A tibble: 6 x 2
name color
<chr> <chr>
1 A red
2 B red
3 C green
4 D red
5 E red
6 F red
> dt %>% mutate(same = ifelse(color == lag(color), 1, 0))
# A tibble: 6 x 3
name color same
<chr> <chr> <dbl>
1 A red NA
2 B red 1
3 C green 0
4 D red 0
5 E red 1
6 F red 1
You can try the lag function from dplyr package. You can create a new column with the values of the row above and after compare them,
> dt$color_above <- lag(dt$color, n=1)
> dt
name color color_above
1 A red <NA>
2 B red red
3 C green red
4 D red green
5 E red red
6 F red red
Or solve the issue directly, you can use the pipe-operators from magrittr package. It is still verbose, but i think it keeps the code more clear.
> dt %$%
{ color == lag(color, n=1) } %>%
as.numeric() %>%
{.} -> dt$same
> dt
name color same
1 A red NA
2 B red 1
3 C green 0
4 D red 0
5 E red 1
6 F red 1
How about this in standard R (maybe a bit more readable, but not much shorter than yours):
colour <- c("red","red","green","red","red","red")
(c(NA, colour) == c(colour, NA))[1:length(colour)]
[1] NA TRUE FALSE FALSE TRUE TRUE
I have a data frame such as data
data = data.frame(ID = as.factor(c("A", "A", "B","B","C","C")),
var.color= as.factor(c("red", "blue", "green", "red", "green", "yellow")))
I wonder whether it is possible to get the levels of each group in ID (e.g. A, B, C) and create a variable that pastes them. I have attempted to do so by running the following:
data %>% group_by(ID) %>%
mutate(ex = paste(droplevels(var.color), sep = "_"))
That yields:
Source: local data frame [6 x 3]
Groups: ID [3]
ID var.color ex
<fctr> <fctr> <chr>
1 A red red
2 A blue blue
3 B green red
4 B red red
5 C green green
6 C yellow yellow
However, my desired data.frame should be something like:
ID var.color ex
<fctr> <fctr> <chr>
1 A red red_blue
2 A blue red_blue
3 B green green_red
4 B red green_red
5 C green green_yellow
6 C yellow green_yellow
Basically, you need collapse instead of sep
Instead of dropping levels , you can just paste the text together grouped by ID
library(dplyr)
data %>% group_by(ID) %>%
mutate(ex = paste(var.color, collapse = "_"))
# ID var.color ex
# <fctr> <fctr> <chr>
#1 A red red_blue
#2 A blue red_blue
#3 B green green_red
#4 B red green_red
#5 C green green_yellow
#6 C yellow green_yellow
You can do the same by using loops
for(i in unique(data$ID)){
data$ex[data$ID==i] <- paste0(data$var.color[data$ID==i], collapse = "_")
}
> data
ID var.color ex
1 A red red_blue
2 A blue red_blue
3 B green green_red
4 B red green_red
5 C green green_yellow
6 C yellow green_yellow