This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I'd like to create a new data table from my old one that includes a count of all the "article_id" that occur for each date (i.e. there are three article_id's listed for the date 2001-10-01, so I'd like one column with the date and one column that has the article count, "3").
Here is the output of the data table:
date article_id N
1: 2001-09-01 FAS_200109_11104 3
2: 2001-10-01 FAS_200110_11126 6
3: 2001-10-01 FAS_200110_11157 21
4: 2001-10-01 FAS_200110_11160 5
5: 2001-11-01 FAS_200111_11220 26
---
7359: 2019-08-01 FAZ_201908_2958 7
7360: 2019-09-01 FAZ_201909_3316 8
7361: 2019-09-01 FAZ_201909_3515 13
7362: 2000-12-01 FAZ_200012_92981 3
7363: 2001-08-01 FAZ_200108_86041 14
So I'll have to move over the unique date values to a new data frame (so that each date is only shown once), as well as a count of article_id's shown for each date.
I've been trying to figure this out but haven't found exactly the right answer regarding how to count the occurrence of a character vector (the article_id) by group (date). I think this is something pretty simple in R, but I'm new to the program and don't have much support so I would very much appreciate your suggestions - thank you so much!
The expected output is not clear. Some assumptions of expected output
Sum of 'N' by 'date'
library(data.table)
dt[, .(N = sum(N, na.rm = TRUE)), by = date]
Count of unique 'article_id' for each date
dt1[, .(N = uniqueN(article_id)), by = date]
Get the first count by 'date'
dt1[, .(N = first(N)), by = date]
We could group and then summarise:
library(dplyr)
df %>%
group_by(date) %>%
summarise(n = n())
date n
<chr> <int>
1 2000-12-01 1
2 2001-08-01 1
3 2001-09-01 1
4 2001-10-01 3
5 2001-11-01 1
6 2019-08-01 1
7 2019-09-01 2
Here 2 tidyverse solutions:
Libraries
library(tidyverse)
Example Data
df <-
tibble(
date = ymd(c("2001-09-01","2001-10-01","2001-10-01")),
article_id = c("FAS_200109_11104","FAS_200110_11126","FAS_200110_11157"),
N = c(3,6,21)
)
Solution
Solution 1
df %>%
group_by(date) %>%
summarise(N = sum(N,na.rm = TRUE))
Solution 2
df %>%
count(date,wt = N)
Result
# A tibble: 2 x 2
date n
<date> <dbl>
1 2001-09-01 3
2 2001-10-01 27
Related
Using the below code in dplyr 0.7.6, I try to calculate the rank of a variable for each day on a dataset. But dplyr doesn't account for the group_by(CREATIONDATE_DAY)
dates <- sample(seq(from=as.POSIXct("2019-03-12",tz="UTC"),to=as.POSIXct("2019-03-20",tz="UTC"),by = "day"),size = 100,replace=TRUE)
group <- sample(c("A","B","C"),100,TRUE)
df <- data.frame(CREATIONDATE_DAY = dates,GROUP = group)
# calculate the occurances for each day and group
dfMod <- df %>% group_by(CREATIONDATE_DAY,GROUP) %>%
dplyr::summarise(COUNT = n()) %>% ungroup()
# Compute the rank by count for each day
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>%
mutate(rank = rank(-COUNT, ties.method ="min"))
But the rank values are calculate on the entire group instead on the creation day value. As seen in the image the row with id 24 should be rank 1 due to 4 being the highest value for 16.03.2019 and row 23 should be rank 2 of this particular day. Where is my mistake?
Edit: added desired output:
Edit #2: as MrFlick has pointed out I checked my dplyr version (0.7.6) and upgrade to the most current version fixed the issue for me.
It seems that may be are some conflict with another package. If you have active lubridate, try to inverse the order in which you call the packages lubridate and dplyr (I've tried your example and gave me the right answer). Yet, you can stil try with:
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>% mutate(rank = row_number(desc(COUNT)))
> head(dfMod)
# A tibble: 6 x 4
# Groups: CREATIONDATE_DAY [2]
CREATIONDATE_DAY GROUP COUNT rank
<dttm> <fct> <int> <int>
1 2019-03-12 00:00:00 A 2 3
2 2019-03-12 00:00:00 B 5 1
3 2019-03-12 00:00:00 C 4 2
4 2019-03-13 00:00:00 A 4 1
5 2019-03-13 00:00:00 B 3 2
6 2019-03-13 00:00:00 C 2 3
I have this data basically, but larger:
I want to count a number of distinct combinations of (customer_id, account_id) - that is, distinct or unique values based on two columns, but for each start_date. I can't find the solution anywhere. The result should be another column added to my data.table that should look like this:
That is, for each start_date, it calculates number of distinct values based on both customer_id and account_id.
For example, for start_date equal to 2.2.2018, I have distinct combinations in (customer_id,account_id) being (4,22) (5,38) and (6,13), so I want count to be equal to 3 because I have 3 distinct combinations. I also need the solution to work with character values in customer_id and account_id columns.
Code to replicate the data:
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-02","%Y-%m-%d"),each=4))
data <- data.table(customer_id,account_id,start_date)
Another dplyr option:
library(dplyr)
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date)%>%
mutate(distinct_values = n_distinct(customer_id, account_id)) %>%
ungroup()
dplyr option
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date, customer_id, account_id) %>%
summarise(Total = 1) %>%
group_by(start_date) %>%
summarise(Count =n())
Here is a data.table option
data[, N := uniqueN(paste0(customer_id, account_id, "_")), by = start_date]
# customer_id account_id start_date N
# 1: 1 11 2017-01-01 4
# 2: 1 11 2017-01-01 4
# 3: 1 11 2017-01-01 4
# 4: 2 11 2017-01-01 4
# 5: 3 55 2017-01-01 4
# 6: 3 88 2017-01-01 4
# 7: 4 22 2018-02-02 3
# 8: 5 38 2018-02-02 3
# 9: 5 38 2018-02-02 3
#10: 6 13 2018-02-02 3
Or
data[, N := uniqueN(.SD, by = c("customer_id", "account_id")), by = start_date]
I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')
This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 5 years ago.
I have a dataframe df with a column called ID.
Multiple rows may have the same ID and I want to set a column value "occurrence" to indicate how many times the ID has been seen before.
for (i in unique(df$ID)) {
rows = df[df$ID==i, ]
for (idx in 1:nrow(rows)) {
rows[idx,'occurrence'] = idx
}
}
Unfortunately, this adds the occurrence column to rows, but it does not update the original data frame. How do I get the occurrence column added to df?
Update: The row_number() function pointed out by neilfws works great. Actually, I have a followup question: The dataframe also has a year column, an what I need to do is to add a new column (say Prev.Year.For.This.ID) for the year of the previous occurrence of the ID. e.g if the input is
Year = c(1991,1991,1993,1994,1995)
ID = c(1,2,1,2,1)
df <- data.frame (Year, ID)
I'd like the output to look like this:
ID Year occurrence Prev.Year.For.This.Id
1 1991 1 <NA>
2 1992 1 <NA>
1 1993 2 1991
2 1994 2 1992
1 1995 3 1993
You can use dplyr to group_by ID, then row_number gives the running total of occurrences.
library(dplyr)
df1 <- data.frame(ID = c(1,2,3,1,4,5,6,2,7,8,2))
df1 %>%
group_by(ID) %>%
mutate(cnt = row_number()) %>%
ungroup()
ID cnt
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 1 2
5 4 1
6 5 1
7 6 1
8 2 2
9 7 1
10 8 1
11 2 3
Are you after something like the following (I made up sample data for you):
library(dplyr)
df = data.frame(ID = c(1,1,1,2,2,3))
answer = df %>% group_by(ID) %>% mutate(occurrence = cumsum(ID / ID) - 1) %>% as.data.frame
This will give something which looks like this:
ID occurrence
1 0
1 1
1 2
2 0
2 1
3 0
The dplyr package is a great tool for grouping and summarising data. I also find the code very readable when I use the pipe %>% (though, admittedly, it does take some getting used to).
> library(data.table)
> df = data.frame(ID = c(1,1,1,2,2,3))
> df <- data.table(df)
> df[, occurrence := sequence(.N), by = c("ID")]
> df
ID occurrence
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 3 1
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
faster way to create variable that aggregates a column by id [duplicate]
(6 answers)
Closed 5 years ago.
I have a column for company, one for sales and another column for country.I need to sum all the sales in each of the countries separately so that I would have one column for each of the companies(names) for the total sales in the country. The sales in all of the countries is expressed in the same currency.
I have tried several ways of doing so, but neither of them work:
df$total_country_sales = if(df$country[row] == df$country) { sum(df$sales)}
This sums all valuations, not only the ones that I need.
Name Sales Country I would like to have a new column Total Country Sales
abc 122 US 5022
abc 100 Canada
aad 4900 US
I need to have the values in the same dataframe, but in a new column.
Since it is a large dataset, I cannot make a function to do so, but rather need to save it directly as a variable. (Or have I understood incorrectly that making functions is not the best way to solve such issues?)
I am new to R and programming in general, so I might be addressing the issue in an incorrect way.
Sorry for probably a stupid question.
Thanks!
If I understand your question correctly, this solves your problem:
df = data.frame(sales=c(1,3,2,4,5),region=c("A","A","B","B","B"))
library(dplyr)
totals = df %>% group_by(region) %>% summarize(total = sum(sales))
df = left_join(df,totals)
It adds the group totals as a separate column, like this:
sales region total
1 1 A 4
2 3 A 4
3 2 B 11
4 4 B 11
5 5 B 11
Hope this helps.
We can use base R to do this
df$total_country_sales <- with(df, ave(sales, country, FUN = sum))
It can be achieved using dplyr's mutate()
df = data.frame(sales=c(1,3,2,4,5),country=c("A","A","B","B","B"))
df
# sales country
# 1 1 A
# 2 3 A
# 3 2 B
# 4 4 B
# 5 5 B
df %>% group_by(country) %>% mutate(total_sales = sum(sales))
# Source: local data frame [5 x 3]
# Groups: country [2]
#
# # A tibble: 5 x 3
# sales country total_sales
# <dbl> <fctr> <dbl>
# 1 1 A 4
# 2 3 A 4
# 3 2 B 11
# 4 4 B 11
# 5 5 B 11
using data.table
library(data.table)
setDT(df)[, total_sales := sum(sales), by = country]
df
# sales country total_sales
# 1: 1 A 4
# 2: 3 A 4
# 3: 2 B 11
# 4: 4 B 11
# 5: 5 B 11