Count per year with only start and end year data - r

I want to create a line chart in ggplot2 with 350 beer breweries. I want to count per year how many active breweries there are. I only have the start and end date of brewery activity. tidyverse answers prefered.
begin_datum_jaar is year the brewery started. eind_datum_jaar is in which year the brewery has ended.
example data frame:
library(tidyverse)
# A tibble: 4 x 3
brouwerijnaam begin_datum_jaar eind_datum_jaar
<chr> <int> <int>
1 Brand 1340 2019
2 Heineken 1592 2019
3 Grolsche 1615 2019
4 Bavaria 1719 2010
dput:
df <- structure(list(brouwerijnaam = c("Brand", "Heineken", "Grolsche",
"Bavaria"), begin_datum_jaar = c(1340L, 1592L, 1615L, 1719L),
eind_datum_jaar = c(2019L, 2019L, 2019L, 2010L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
Desired output where etc. is a placeholder.
# A tibble: 13 x 2
year n
<chr> <dbl>
1 1340 1
2 1341 1
3 1342 1
4 1343 1
5 etc. 1
6 1592 2
7 1593 2
8 etc. 2
9 1625 3
10 1626 3
11 1627 3
12 1628 3
13 etc. 3

Could try:
library(tidyverse)
df %>%
rowwise %>%
do(data.frame(brouwerij = .$brouwerijnaam,
Year = seq(.$begin_datum_jaar, .$eind_datum_jaar, by = 1))) %>%
count(Year, name = "Active breweries") %>%
ggplot(aes(x = Year, y = `Active breweries`)) +
geom_line() +
theme_minimal()
Or try expand for the first part:
df %>%
group_by(brouwerijnaam) %>%
expand(Year = begin_datum_jaar:eind_datum_jaar) %>%
ungroup() %>%
count(Year, name = "Active breweries")
However, note that the rowwise, do or expand parts are resource intensive and may take long time. If that happens, I'd rather use data.table for expanding the data frame, and then continue, like below:
library(data.table)
library(tidyverse)
df <- setDT(df)[, .(Year = seq(begin_datum_jaar, eind_datum_jaar, by = 1)), by = brouwerijnaam]
df %>%
count(Year, name = "Active breweries") %>%
ggplot(aes(x = Year, y = `Active breweries`)) +
geom_line() +
theme_minimal()
The above gives you the plot directly. If you'd like to save it to a data frame first (and then do the ggplot2 thing), this is the main part (I use the data.table for expanding as it's much faster in my experience):
library(data.table)
library(tidyverse)
df <- setDT(df)[
, .(Year = seq(begin_datum_jaar, eind_datum_jaar, by = 1)),
by = brouwerijnaam] %>%
count(Year, name = "Active breweries")
Output:
# A tibble: 680 x 2
Year `Active breweries`
<dbl> <int>
1 1340 1
2 1341 1
3 1342 1
4 1343 1
5 1344 1
6 1345 1
7 1346 1
8 1347 1
9 1348 1
10 1349 1
# ... with 670 more rows

We can use map2 to get the sequence from start to end date for each corresponding element, unnest the list column to expand and use count to get the frequency of the 'year'
library(tidyverse)
df %>%
transmute(year = map2(begin_datum_jaar, eind_datum_jaar, `:`)) %>%
unnest %>%
count(year)
# A tibble: 680 x 2
# year n
# <int> <int>
# 1 1340 1
# 2 1341 1
# 3 1342 1
# 4 1343 1
# 5 1344 1
# 6 1345 1
# 7 1346 1
# 8 1347 1
# 9 1348 1
#10 1349 1
# … with 670 more rows
Or using Map from base R
table(unlist(do.call(Map, c(f = `:`, df[-1]))))

df1 <- data.frame(year=1000:2020) # Enter range for years of choice
df1 %>%
rowwise()%>%
mutate(cnt=nrow(df %>%
filter(begin_datum_jaar<year & eind_datum_jaar>year)
)
)

Related

Count number of times a value occurs and get the sum in R

I have seen these questions have been answered multiple times before. However, my case is very unique and I need your input.
I've a data that looks like this
PipeID price
125 100
125 200
456 300
523 400
523 500
I got the number of occurrence by PipeID using the following code
num_occur <- df %>%
group_by(PipeID) %>%
summarise(PipeID = sum(PipeID), freq = n())
This gives me the following results
PipeID freq
125 2
456 1
523 2
However I want to find the sum of the frequency. For example, the sum of the number of times a frequency of 2 occurs. The desired output is as follows:
freq count
2 4
1 1
Any timely help would be greatly appreciated!
You simplify a bit with
df %>%
count(PipeID, name = "freq") %>%
group_by(freq) %>%
summarise(count = sum(freq))
#> # A tibble: 2 x 2
#> freq count
#> <int> <int>
#> 1 1 1
#> 2 2 4
Does this solve your problem?
library(tidyverse)
df <- read.table(text = "PipeID price
125 100
125 200
456 300
523 400
523 500",
header = TRUE)
df %>%
group_by(PipeID) %>%
summarise(PipeID = sum(PipeID),
freq = n()) %>%
group_by(freq) %>%
summarise(count = sum(freq))
#> # A tibble: 2 × 2
#> freq count
#> <int> <int>
#> 1 1 1
#> 2 2 4
Created on 2022-12-08 with reprex v2.0.2

How to summarize `Number of days since first date` and `Number of days seen` by ID and for a large data frame

The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294

Performing in group operations in R

I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)

Rolling sum over multiple columns in r

I am working on R with a dataset that looks like this:
Screen shot of dataset
test=data.frame("1991" = c(1,5,3), "1992" = c(4,3,3), "1993" = c(10,5,3), "1994" = c(1,1,1), "1995" = c(2,2,6))
test=plyr::rename(test, c("X1991"="1991", "X1992"="1992", "X1993"="1993", "X1994"="1994", "X1995"="1995"))
What I want to do is that I want to create variables called Pre1991, Pre1992, Pre1993, ... and these variables would store the cumulated values up to that year, e.g.
Pre1991 = test$1991
Pre1992 = test$1991 + test$1992
Pre1993 = test$1991 + test$1992 + test$1993
so on.
My real dataset has variables from year 1900-2017 so I can't do this manually. I want to write a for loop but it didnt work.
for (i in 1900:2017){
x = paste0("Pre",i)
df[[x]] = rowSums(df[,(colnames(df)<=i)])
}
Can someone please help to review my code/ suggest other ways to do it? Thanks!
Edit 1:
Thanks so much! And I'm wondering if there's a way that I can use cumsum function in a reverse direction? For example, if I am interested in what happened after that particular year:
Post1991 = test$1992 + test$1993 + test$1994 + test$1995 + ...
Post1992 = test$1993 + test$1994 + test$1995 + ...
Post1993 = test$1994 + test$1995 + ...
This is a little inefficient in that it is converting from a data.frame to a matrix and back, but ...
as.data.frame(t(apply(as.matrix(test), 1, cumsum)))
# 1991 1992 1993 1994 1995
# 1 1 5 15 16 18
# 2 5 8 13 14 16
# 3 3 6 9 10 16
If your data has other columns that are not year-based, such as
test$quux <- LETTERS[3:5]
test
# 1991 1992 1993 1994 1995 quux
# 1 1 4 10 1 2 C
# 2 5 3 5 1 2 D
# 3 3 3 3 1 6 E
then subset on both sides:
test[1:5] <- as.data.frame(t(apply(as.matrix(test[1:5]), 1, cumsum)))
test
# 1991 1992 1993 1994 1995 quux
# 1 1 5 15 16 18 C
# 2 5 8 13 14 16 D
# 3 3 6 9 10 16 E
EDIT
In reverse, just use repeated rev:
as.data.frame(t(apply(as.matrix(test), 1, function(a) rev(cumsum(rev(a)))-a)))
# 1991 1992 1993 1994 1995
# 1 17 13 3 2 0
# 2 11 8 3 2 0
# 3 13 10 7 6 0
Using tidyverse we can gather and calculate before then spreading again. For this to work data will need to be arranged.
library(tidyverse)
test <- data.frame("1991" = c(1, 5, 3),
"1992" = c(4, 3, 3),
"1993" = c(10, 5, 3),
"1994" = c(1, 1, 1),
"1995" = c(2, 2, 6))
test <- plyr::rename(test, c("X1991" = "1991",
"X1992" = "1992",
"X1993" = "1993",
"X1994" = "1994",
"X1995" = "1995"))
Forwards
test %>%
mutate(id = 1:nrow(.)) %>% # adding an ID to identify groups
gather(year, value, -id) %>% # wide to long format
arrange(id, year) %>%
group_by(id) %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
spread(year, value) %>% # long to wide format
select(-id) %>%
setNames(paste0("pre", names(.))) # add prefix to columns
## A tibble: 3 x 5
# pre1991 pre1992 pre1993 pre1994 pre1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1. 5. 15. 16. 18.
# 2 5. 8. 13. 14. 16.
# 3 3. 6. 9. 10. 16.
Reverse direction
As your definition specifies its not strictly the reverse order, its the reverse order excluding itself which would be the cumulative lagged sum.
test %>%
mutate(id = 1:nrow(.)) %>%
gather(year, value, -id) %>%
arrange(id, desc(year)) %>% # using desc() to reverse sorting
group_by(id) %>%
mutate(value = cumsum(lag(value, default = 0))) %>% # lag cumsum
ungroup() %>%
spread(year, value) %>%
select(-id) %>%
setNames(paste0("post", names(.)))
## A tibble: 3 x 5
# post1991 post1992 post1993 post1994 post1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 17. 13. 3. 2. 0.
# 2 11. 8. 3. 2. 0.
# 3 13. 10. 7. 6. 0.
We can use rowCumsums from matrixStats
library(matrixStats)
test[] <- rowCumsums(as.matrix(test))
test
# 1991 1992 1993 1994 1995
#1 1 5 15 16 18
#2 5 8 13 14 16
#3 3 6 9 10 16

keep only consecutive observations

As said in the title, I have a data.frame like below,
df<-data.frame('id'=c('1','1','1','1','1','1','1'),'time'=c('1998','2000','2001','2002','2003','2004','2007'))
df
id time
1 1 1998
2 1 2000
3 1 2001
4 1 2002
5 1 2003
6 1 2004
7 1 2007
there are some others cases with shorter or longer time window than this,just for illustration's sake.
I want to do two things about this data set, first, find all those id that have at least five consecutive observations here, this can be done by following solutions here. Second, I want to keep only those observations in the at least five consecutive row of id selected by first step. The ideal result would be :
df
id time
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
I could write a complex function using for loop and diff function, but this may be very time consuming both in writing the function and getting the result if I have a bigger data set with lots if id. But this is not seems like R and I do believe there should be a one or two line solution.
Anyone know how to achieve this? your time and knowledge would be deeply appreciated. Thanks in advance.
You can use dplyr to group by id and consecutive time, and filter groups with less than 5 entries, i.e.
#read data with stringsAsFactors = FALSE
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
library(dplyr)
df %>%
mutate(time = as.integer(time)) %>%
group_by(id, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= 5)
which gives
# A tibble: 5 x 3
# Groups: id, grp [1]
id time grp
<chr> <int> <dbl>
1 1 2000 2
2 1 2001 2
3 1 2002 2
4 1 2003 2
5 1 2004 2
Similar to #Sotos answer, this solution instead uses seqle (from cgwtools) as the grouping variable:
library(dplyr)
library(cgwtools)
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5)
Result:
# A tibble: 5 x 3
# Groups: id, consec [1]
id time consec
<chr> <dbl> <int>
1 1 2000 5
2 1 2001 5
3 1 2002 5
4 1 2003 5
5 1 2004 5
To remove grouping variable:
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5) %>%
ungroup() %>%
select(-consec)
Result:
# A tibble: 5 x 2
id time
<chr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
Data:
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
Try that on your data:
df[,] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE))
IND1 <- (df$time - c(df$time[-1],df$time[length(df$time)-1])) %>% abs(.)
IND2 <- (df$time - c(df$time[2],df$time[-(length(df$time))])) %>% abs(.)
df <- df[IND1 %in% 1 | IND2 %in% 1,]
df[ave(df$time, df$id, FUN = length) >= 5, ]
A solution from dplyr, tidyr, and data.table.
library(dplyr)
library(tidyr)
library(data.table)
df2 <- df %>%
mutate(time = as.numeric(as.character(time))) %>%
arrange(id, time) %>%
right_join(data_frame(time = full_seq(.$time, 1)), by = "time") %>%
mutate(RunID = rleid(id)) %>%
group_by(RunID) %>%
filter(n() >= 5, !is.na(id)) %>%
ungroup() %>%
select(-RunID)
df2
# A tibble: 5 x 2
id time
<fctr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004

Resources