This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 6 years ago.
Let's say that I have a simple data frame in R, as follows:
#example data frame
a = c("red","red","green")
b = c("01/01/1900","01/02/1950","01/05/1990")
df = data.frame(a,b)
colnames(df)<-c("Color","Dates")
My goal is to count the number of dates (as a class - not individually) for each variable in the "Color" column. So, the result would look like this:
#output should look like this:
a = c("red","green")
b = c("2","1")
df = data.frame(a,b)
colnames(df)<-c("Color","Dates")
Red was associated with two dates -- the dates themselves are unimportant, I'd just like to count the aggregate number of dates per color in the data frame.
Or in base R:
sapply(split(df, df$Color), nrow)
# green red
# 1 2
We can use data.table
library(data.table)
setDT(df)[, .(Dates = uniqueN(Dates)) , Color]
# Color Dates
#1: red 2
#2: green 1
using the dplyr package from the tidyverse:
library(dplyr)
df %>% group_by(Color) %>% summarise(n())
# # A tibble: 2 × 2
# Color `n()`
# <fctr> <int>
# 1 green 1
# 2 red 2
Related
This question already has answers here:
How can I calculate the sum of comma separated in the 2nd column
(3 answers)
Closed 2 years ago.
I have a data frame:
df <- data.frame(sample_names=c("foo","bar","foo, bar"), sample_values=c(1,5,3))
df
sample_names sample_values
1 foo 1
2 bar 5
3 foo, bar 3
and I want a resulting data.frame of the following shape:
sample_names sample_values
1 foo 4
2 bar 8
Is there a elegant way to achieve this? My workaround would be to grep by "," and somehow fidly add the result to the existing rows. Since I want to apply this on multiple dataframes, I'd like to come up with an easier solution. Any ideas for this?
We can use separate_rows to split the column, then do a group by operation to get the sum
library(dplyr)
library(tidyr)
df %>%
separate_rows(sample_names) %>%
group_by(sample_names) %>%
summarise(sample_values = sum(sample_values), .groups = 'drop')
-output
# A tibble: 2 x 2
# sample_names sample_values
# <chr> <dbl>
#1 bar 8
#2 foo 4
Or with base R by splitting the column with strsplit into a list of vectors, then use tapply to do a group by sum
lst1 <- strsplit(df$sample_names, ",\\s+")
tapply(rep(df$sample_values, lengths(lst1)), unlist(lst1), FUN = sum)
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
I Have a data frame as follows:
df <- data.frame(Name = c("a","c","d","b","f","g","h"), group = c(2,1,2,3,1,3,1))
Name group
a 2
c 1
d 2
b 3
f 1
g 3
h 1
I would like to use gather function from tidyverse package to reshape my data frame to the following format.
group Name total
1 c,f,h 3
2 a,d 2
3 b,h 2
Do you know how can I do this?
Thanks,
We can group by 'group' and paste the elements of 'Name' with toString, while getting the total number of elements with n()
library(dplyr)
df %>%
group_by(group) %>%
summarise(Name = toString(Name), total = n())
This question already has answers here:
"Adding missing grouping variables" message in dplyr in R
(4 answers)
Closed 4 years ago.
So, I have a large data.frame with multiple columns of which "trial.number" and "indexer" are 2.
It annoys me that dplyr constantly, no matter what, adds indexer column.
A simple example:
saccade.df %>%
distinct(trial.number, .keep_all = F)
I would expect to see the the unique trial.numbers and only the trial.number column. However, the output looks like this:
How do I stop dplyr from doing this? And why isn't it showing the unique trial.numbers but only the unique indexer (for which I didnt even ask).
example.df <- data.frame(trial.number = rep(1:10, each = 10), time =
seq(1:100), indexer = rep(21:30, each = 10))
example.df %>%
distinct(trial.number, .keep_all = F)
This goes give the right output. However, I somehow grouped my own variables.
Thanks!
Try ungroup :
df <- data.frame(trial.number=1:2,indexer=3:4)
df %>% distinct(trial.number)
# trial.number
#1 1
#2 2
df %>% group_by(trial.number,indexer) %>% distinct(trial.number)
## A tibble: 2 x 2
## Groups: trial.number, indexer [2]
# trial.number indexer
# <int> <int>
#1 1 3
#2 2 4
df %>% group_by(trial.number,indexer) %>% ungroup %>% distinct(trial.number)
## A tibble: 2 x 1
# trial.number
# <int>
#1 1
#2 2
df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
df1
df2 <-
data.frame(Sector=c("auto","auto","auto"),
Topic=c("1","2","3"),
Frequency=c(1,2,5))
df2
I have the dataframe 1 (df1) above and want a conditional subset of it that looks like df2. The condition is as followed:
"If at least one observation of the corresponding sectors has a larger frequency than 3 it should keep all observation of the sector, if not, all observations of the corresponding sector should be dropped."
In the example obove, only the three observations of the auto-sector remain, industry is dropped.
Has anybody an idea by which condition I might achieve the aimed subset?
We can use group_by and filter from dplyr to achieve this.
library(dplyr)
df2 <- df1 %>%
group_by(Sector) %>%
filter(any(Frequency > 3)) %>%
ungroup()
df2
# # A tibble: 3 x 3
# Sector Topic Frequency
# <fct> <fct> <dbl>
# 1 auto 1 1.
# 2 auto 2 2.
# 3 auto 3 5.
Here is a solution with base R:
df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
subset(df1, ave(Frequency, Sector, FUN=max) >3)
and a solution with data.table:
library("data.table")
setDT(df1)[, if (max(Frequency)>3) .SD, by=Sector]
This question already has answers here:
count number of rows in a data frame in R based on group [duplicate]
(8 answers)
Closed 6 years ago.
Say I have a data table like this:
id days age
"jdkl" 8 23
"aowl" 1 09
"mnoap" 4 82
"jdkl" 3 14
"jdkl" 2 34
"mnoap" 27 56
I want to create a new data table that has one column with the ids and one column with the number of times they appear. I know that data table has something with =.N, but I wasn't sure how to use it for only one column.
The final data table would look like this:
id count
"jdkl" 3
"aowl" 2
"mnoap" 1
You can just use table from base R:
as.data.frame(sort(table(df$id), decreasing = T))
However, if you want to do it using data.table:
library(data.table)
setDT(df)[, .(Count = .N), by = id][order(-Count)]
or there is the dplyr solution
library(dplyr)
df %>% count(id) %>% arrange(desc(n))
We can use
library(dplyr)
df %>%
group_by(id) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
Or using aggregate from base R
r1 <- aggregate(cbind(Count=days)~id, df1, length)
r1[order(-r1$Count),]
# id Count
#2 jdkl 3
#3 mnoap 2
#1 aowl 1