I have a dataset of random, sometimes infrequent, events that I want to count as a sum per week. Due to the randomness they are not linear so other examples I have tried so far are not applicable.
The data is similar to this:
df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
Dates = c("2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) )
What I'm trying to do is create a new table that contains the sum of events per week in the calendar year.
In this case producing something like this:
Name Week Events
Jim 1 3
Sue 1 3
Jim 2 0
Sue x ... x
and so on...
Update OP request for multiple years:
We could use isoweek also from lubridate instead of week
OR:
We could add the year as follows:
df_date %>%
as_tibble() %>%
mutate(Week = week(ymd(Dates))) %>%
mutate(Year = year(ymd(Dates))) %>%
count(Name, Year, Week)
We could use lubridates Week function after transforming character Dates to date format with lubridates ymd function.
Then we can use count which is the short for group_by(Name, Week) %>% summarise(Count = n())
:
library(dplyr)
library(lubridate)
df_date %>%
as_tibble() %>%
mutate(Week = week(ymd(Dates))) %>%
count(Name, Week)
Name Week n
<chr> <dbl> <int>
1 Jim 1 3
2 Jim 3 2
3 Jim 5 1
4 Jim 6 2
5 Jim 7 1
6 Jim 9 1
7 Sue 1 3
8 Sue 3 2
9 Sue 5 1
10 Sue 6 2
11 Sue 7 1
12 Sue 9 1
Here is an approach that gets you each ISO week for each individual, with zeros when there are no events for that week for that individual:
get_dates_df <- function(d) {
data.frame(date = seq(min(d, na.rm=T),max(d,na.rm=T),1)) %>%
mutate(Year=year(date), Week=week(date)) %>%
distinct(Year, Week)
}
df_date = df_date %>% mutate(Dates=lubridate::ymd(Dates))
left_join(
full_join(distinct(df_date %>% select(Name)), get_dates_df(df_date$Dates), by=character()),
df_date %>%
group_by(Name,Year=year(Dates), Week=week(Dates)) %>%
summarize(Events = sum(Event), .groups="drop")
) %>%
mutate(Events=if_else(is.na(Events),0,Events))
Output:
Name Year Week Events
1 Jim 2010 1 3
2 Jim 2010 2 0
3 Jim 2010 3 2
4 Jim 2010 4 0
5 Jim 2010 5 1
6 Jim 2010 6 2
7 Jim 2010 7 1
8 Jim 2010 8 0
9 Jim 2010 9 1
10 Sue 2010 1 3
11 Sue 2010 2 0
12 Sue 2010 3 2
13 Sue 2010 4 0
14 Sue 2010 5 1
15 Sue 2010 6 2
16 Sue 2010 7 1
17 Sue 2010 8 0
18 Sue 2010 9 1
Related
I am working with the R.
I have a dataset that looks something like this:
id = c("john", "john", "john", "john","john", "james", "james", "james", "james", "james")
year = c(2010,2011, 2014, 2016,2017, 2013, 2016, 2017, 2018,2020)
var = c(1,1,1,1,1,1,1,1,1,1)
my_data = data.frame(id, year, var)
> my_data
id year var
1 john 2010 1
2 john 2011 1
3 john 2014 1
4 john 2016 1
5 john 2017 1
6 james 2013 1
7 james 2016 1
8 james 2017 1
9 james 2018 1
10 james 2020 1
As we can see, there are some missing years (i.e. non-consecutive years) in this dataset - for each ID, I am trying to add rows corresponding to these missing years and assign the "var" variable as "0" in these rows.
As an example, this would look something like this for the first ID:
id year var
1 john 2010 1
2 john 2011 1
3 john 2012 0
4 john 2013 0
5 john 2014 1
6 john 2015 0
7 john 2016 1
8 john 2017 1
I tried to do this with the following code:
# https://stackoverflow.com/questions/74365569/backfilling-rows-based-on-max-conditions-in-r
library(dplyr)
library(tidyr)
my_data %>%
group_by(id) %>%
complete(year = full_seq(year, period = 1)) %>%
fill(year, var, .direction = "downup") %>%
mutate(var= 0 ) %>%
ungroup
But this is not giving the desired result - as we can see, rows have been deleted and all values of "var" have been replaced with 0:
A tibble: 16 x 3
id year var
<chr> <dbl> <dbl>
1 james 2013 0
2 james 2014 0
3 james 2015 0
4 james 2016 0
5 james 2017 0
6 james 2018 0
7 james 2019 0
8 james 2020 0
Can someone please show me how to fix this problem?
Thanks!
I would include the fill argument in your complete function. There you can specify in a named list what you want to include as values for missing combinations.
library(tidyverse)
my_data %>%
group_by(id) %>%
complete(year = full_seq(year, period = 1), fill = list(var = 0)) %>%
ungroup
Output
id year var
<chr> <dbl> <dbl>
1 james 2013 1
2 james 2014 0
3 james 2015 0
4 james 2016 1
5 james 2017 1
6 james 2018 1
7 james 2019 0
8 james 2020 1
9 john 2010 1
10 john 2011 1
11 john 2012 0
12 john 2013 0
13 john 2014 1
14 john 2015 0
15 john 2016 1
16 john 2017 1
You can create a data.frame with all year's and id's, then do a full_join with the original data.frame
library(dplyr)
library(tidyr)
expand_grid(id = unique(my_data$id),year = min(my_data$year):max(my_data$year)) %>%
full_join(my_data) %>%
replace_na(replace = list(var = 0))
I have a dataframe like the following one:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
and I want to calculate the average value by day for the three year period (2014, 2015, 2016). The following code works for this purpose:
data %>%
group_by(day) %>%
mutate(MEAN = mean(value))
and produces this output:
day year value MEAN
1 2014 5 7
1 2015 16 7
1 2016 0 7
2 2014 3 3
2 2015 1 3
2 2016 4 3
but I want to add the average values as new rows in the same dataframe as follows:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
1 avg 7 <--
2 avg 3 <--
Any suggestions about how can I possibly do this? Thanks!
We can use summarise (instead of mutate - which adds a new column in the original dataset) to calculate the mean and then with bind_rows can bind with original data. The tidyverse functions are very particular about type, so make sure the class are the same before we do the binding
library(dplyr)
data %>%
group_by(day) %>%
summarise(year = 'avg', value = mean(value)) %>%
bind_rows(data %>%
mutate(year = as.character(year)), .)
# day year value
#1 1 2014 5.00
#2 1 2015 16.00
#3 1 2016 0.00
#4 2 2014 3.00
#5 2 2015 1.00
#6 2 2016 4.00
#7 1 avg 7.00
#8 2 avg 2.67
Another option is to split by the 'day' and then with add_row (from tibble) create a new row on each of the list elements
library(tibble)
library(purrr)
data %>%
mutate(year = as.character(year)) %>%
group_split(day) %>%
map_dfr(~ .x %>% add_row(day = first(.$day),
year = 'avg', value = mean(.$value)))
Here is a base R option using aggregate
rbind(df,cbind(aggregate(value~day,df,mean),year = "avg")[c(1,3,2)])
or a variation (by #thelatemail from comments)
rbind(df, aggregate(df["value"], cbind(df["day"], year="avg"), FUN=mean))
which gives
day year value
1 1 2014 5.000000
2 1 2015 16.000000
3 1 2016 0.000000
4 2 2014 3.000000
5 2 2015 1.000000
6 2 2016 4.000000
7 1 avg 7.000000
8 2 avg 2.666667
I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
library(dplyr)
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
select(-n)
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA
I'm trying, for each row, to calculate the difference with the closest previous row belonging to the same group which meets a certain criterion.
Suppose I have the following dataframe:
s <- read.table(text = "Visit_num Patient Day Admitted
1 1 2015/01/01 Yes
2 1 2015/01/10 No
3 1 2015/01/15 Yes
4 1 2015/02/10 No
5 1 2015/03/08 Yes
6 2 2015/01/01 Yes
7 2 2015/04/01 No
8 2 2015/04/10 No
9 3 2015/04/01 No
10 3 2015/04/10 No", header = T, sep = "")
For each Visit_num and for each Patient, I'd like to get the difference with the closest row for which the patient was admitted (i.e. Yes). Note column day is ordered by day, and time unit for this example is days.
Here is what I wanted my dataframe to look like:
Visit_num Patient Day Admitted Diff_days
1 1 2015/01/01 Yes NA
2 1 2015/01/10 No 9
3 1 2015/01/15 Yes 14
4 1 2015/02/10 No 26
5 1 2015/03/08 Yes 52
6 2 2015/01/01 Yes NA
7 2 2015/04/01 No 90
8 2 2015/04/10 No 99
9 3 2015/04/01 No NA
10 3 2015/04/10 No NA
Any help is appreciated.
Here is an option with tidyverse. Convert the 'Day' to Date class, arrange by 'Patient', 'Day', grouped by 'Patient' get the difference of adjacent 'Day', create a group 'grp' based on the occurrence of 'Yes' in 'Admitted' and take the cumulative sum of 'Diff_days'
library(tidyverse)
s %>%
mutate(Day = ymd(Day)) %>%
arrange(Patient, Day) %>%
group_by(Patient) %>%
mutate(Diff_days = c(NA, diff(Day))) %>%
group_by(grp = cumsum(lag(Admitted == "Yes", default = TRUE)), add = TRUE) %>%
mutate(Diff_days = cumsum(replace_na(Diff_days, 0))) %>%
ungroup %>%
select(-grp) %>%
mutate(Diff_days = na_if(Diff_days, 0))
# A tibble: 8 x 5
# Visit_num Patient Day Admitted Diff_days
# <int> <int> <date> <fct> <dbl>
#1 1 1 2015-01-01 Yes NA
#2 2 1 2015-01-10 No 9
#3 3 1 2015-01-15 Yes 14
#4 4 1 2015-02-10 No 26
#5 5 1 2015-03-08 Yes 52
#6 6 2 2015-01-01 Yes NA
#7 7 2 2015-04-01 No 90
#8 8 2 2015-04-10 No 99
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame df as follows:
df
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
5 n005 2004 Canada 1
6 n006 2005 Britain 2
7 n007 2005 USA 1
8 n008 2005 USA 2
9 n010 2005 USA 1
10 n011 2005 Canada 1
11 n012 2005 USA 2
12 n013 2005 USA 5
13 n014 2005 Canada 1
14 n015 2006 USA 2
15 n017 2006 Canada 1
16 n018 2006 Britain 1
17 n019 2006 Canada 1
18 n020 2006 USA 1
...
where Type is the type of news, and Time is the year when the news was published.
My aim is to count the number of each type of news each year.
I was thinking about a result like this:
...
$2005
Type: 1 Count: 4
Type: 2 Count: 3
Type: 5 Count: 1
$2006
Type: 1 Count: 4
...
I used the following code:
gp = group_by(df, Time)
summarise(gp, table(Time)
Error in summarise_impl(.data, dots) :
Evaluation error: unique() applies only to vectors.
Then I tried split( ), thinking it may be able to separate the dataframe by year so I could count the number of each type by year
split(df, 'Time')
$Time
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
...
Everything is almost the same, apart from the "$Time" sign.
I was wondering what I did wrong, and how to fix it.
We can split Type Column by Time and calculate it's frequency by table.
lapply(split(df$Type, df$Time), table)
#$`2000`
#1
#1
#$`2001`
#5
#1
#$`2003`
#2
#1
#$`2004`
#1 2
#1 1
#$`2005`
#1 2 5
#4 3 1
#$`2006`
#1 2
#4 1
How about this?
df %>%
group_by(Time, Type) %>%
count() %>%
spread(Type, n)
You could use something like this. split on Time, then group by Type and tally the result
df %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% tally())
......
$`2004`
# A tibble: 2 x 2
Type n
<int> <int>
1 1 1
2 2 1
$`2005`
# A tibble: 3 x 2
Type n
<int> <int>
1 1 4
2 2 3
3 5 1
$`2006`
# A tibble: 2 x 2
......
Or use summarise instead of tally if you want a column called count instead of n
df1 %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% summarise(count = n()))