My goal is simply to count the number of records in each hour of each day. I thought a simple solution could be found with the dplyr or data.table packages:
My data set is extremely simple:
> head(test)
id date hour
1 14869663 2018-01-24 17
2 14869664 2018-01-24 17
3 14869665 2018-01-24 17
4 14869666 2018-01-24 17
5 14869667 2018-01-24 17
6 14869668 2018-01-24 17
I only need to group by two variables (date and hour) and count. The id doesn't matter. However, these two methods in dplyr do not seem to produce the desired result (a data frame of the same length of the input data, which includes millions of records, is the output). What am I doing wrong here?
test %>% group_by(date, hour) %>% mutate(count = n())
test %>% add_count(date, hour)
The output would look something like this
> head(output)
n_records date hour
1 700 2018-01-24 0
2 750 2018-01-24 1
3 730 2018-01-24 2
4 700 2018-01-24 3
5 721 2018-01-24 4
6 753 2018-01-24 5
and so on
any suggestions?
This seems to do the trick:
library(dplyr)
starwars %>%
group_by(gender, species) %>%
count
It appears (h/t to Frank) that the count function can take the grouping fields directly:
starwars %>% count(gender, species)
using data.table,
test[, .N, by=.(date, hour)]
Base
aggregate(name ~ gender + species, data = starwars, length)
If we want to treat NAs as a group:
species1 <- factor(starwars$species, exclude = "")
gender1 <- factor(starwars$gender, exclude = "")
aggregate(name ~ gender1 + species1, data = starwars, length)
Related
I am trying to split my data in two based on a gap in the dates. The problem is that in the real data, the duration of the observations is not constant. I am assigning all values of lactation to be 1, and trying to make everything after the long gap to become two.
What I am trying to do:
Identify the gap in days, if the gap is longer than 20 days, we will start counting from 1 again using group_by and row_number.
The problem here is that the lag() function is not carrying the new value after the condition.
###Code
library(dplyr)
library(lubridate)
#simulating the data
name<-"cow1"
milk<-rnorm(500,15,6)
date1<-seq(ymd('2012-01-01'),ymd('2012-09-06'),by='days') %>% as_tibble()
date2<-seq(ymd('2013-01-01'),ymd('2013-09-07'),by='days') %>% as_tibble()
date<-bind_rows(date1,date2) %>% rename("day"=value)
cow1<- milk %>% as_tibble() %>% rename("Yield"=value) %>% mutate(cowid=name,day=date$value)
cow1.1 <- cow1 %>% mutate(lactation=1) %>%
mutate(gap = day - lag(day, default = day[1])) %>%
mutate(lactation=ifelse(gap>20,lag(lactation)+1,lag(lactation))) %>%
group_by(lactation) %>% mutate(dim=row_numer())
Sample result:
Row Yield cowid day lactation gap
250 3.1429436 cow1 2012-09-06 1 1 days
251 10.1427923 cow1 2013-01-01 2 117 days
252 19.8654469 cow1 2013-01-02 1 1 days
Desired result:
Row Yield cowid day lactation gap
250 3.1429436 cow1 2012-09-06 1 1 days
251 10.1427923 cow1 2013-01-01 2 117 days
252 19.8654469 cow1 2013-01-02 2 1 days
I have two data frames containing row entries with respective dates. Data frame 1 contains observations collected from 2010 to 2017.
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2017-12-29 22
2017-12-30 32
2017-12-31 25
Data frame 2 contains observations collected from 2015 to 2020.
dates A
2015-01-01 20
2015-01-02 29
2015-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Both the data frames have missing observations for some days. I wish to combine both data frames to fill out missing data and obtain complete time series upto 2020 without any repeated entries. Like the following data frame:
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Using merge(df1, df2, by = 'dates') or full_join(df1, df2, by = 'dates') creates duplicate entries or two columns A.x and A.y which is not expected.
Try the code below
dfout <- unique(rbind(df1,df2))
dfout <- dfout[order(dfout$dates),]
Combine df1 and df2, if there are duplicate dates which are available in both the dataframes mean the A value and use complete to fill the missing dates.
library(dplyr)
library(tidyr)
df1 %>%
bind_rows(df2) %>%
mutate(dates = as.Date(dates)) %>%
group_by(dates) %>%
summarise(A = mean(A)) %>%
complete(dates = seq(min(date), max(date), by = 'day'))
If your df is really just two columns, you should be able to bind_rows, group_by, and distinct to remove duplicates.
library(dplyr)
df <- bind_rows(df1, df2) %>%
group_by(dates, A) %>%
distinct(dates)
Edit: This will not work if you have data that doesn't agree between the dataframes on a single date. If you have two records for 1/1/15 and they have different A values, those will both be retained.
I have a dataframe in the following format:
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 2020-03-31 6
If the Contract_End - Contract_Begin is less than 1 month, I want to insert the additional months as rows below. Here is the desired output.
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 6
2020-02-01 6
2020-03-01 6
Trying to accomplish in R as a part of pre data processing. Any help is greatly appreciated.
We can use map2 to get the sequence of dates from 'Contract_Begin','Contract_End' and then unnest the listcolumn created by map2 and expand the rows
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
mutate_at(1:2, as.Date) %>%
mutate(Contract_Begin = map2(Contract_Begin, Contract_End, seq,
by = "1 month")) %>%
unnest(c(Contract_Begin))
Given a column of dates, this will count the number of records in each month
library(dplyr)
library(lubridate)
samp <- tbl_df(seq.Date(as.Date("2017-01-01"), as.Date("2017-12-01"), by="day"))
freq <- samp %>%
filter(!is.na(value)) %>%
transmute(month = floor_date(value, "month")) %>%
group_by(month) %>% summarise(adds = n())
freq
# A tibble: 12 x 2
month adds
<date> <int>
1 2017-01-01 31
2 2017-02-01 28
3 2017-03-01 31
4 2017-04-01 30
5 2017-05-01 31
6 2017-06-01 30
7 2017-07-01 31
8 2017-08-01 31
9 2017-09-01 30
10 2017-10-01 31
11 2017-11-01 30
12 2017-12-01 1
>
I would like to convert this to a function, so that I can perform the operation on a number of variables. Have read the vignette on dplyr programming, but continue to have issues.
My attempt;
library(rlang)
count_x_month <- function(df, var, name){
var <- enquo(var)
name <- enquo(name)
df %>%
filter(!is.na(!!var)) %>%
transmute(month := floor_date(!!var, "month")) %>%
group_by(month) %>% summarise(!!name := n())
}
freq2 <- samp %>% count_x_month(value, out)
Error message;
Error: invalid argument type
Making this version of the function work will be a big help. More broadly, other ways to achieve the objective would be welcome.
One way to state the problem; given a dataframe of customers and first purchase dates, count the number of customers purchasing for the first time in each month.
update: The selected answer works in dplyr 0.7.4, but the rstudio environment I have access to has dplyr 0.5.0. What modifications are required to 'backport' this function?
You forgot to quo_name it
library(rlang)
count_x_month <- function(df, var, name){
var <- enquo(var)
name <- enquo(name)
name <- quo_name(name)
df %>%
filter(!is.na(!!var)) %>%
transmute(month := floor_date(!!var, "month")) %>%
group_by(month) %>%
summarise(!!name := n())
}
freq2 <- samp %>% count_x_month(value, out)
# A tibble: 12 x 2
month out
<date> <int>
1 2017-01-01 31
2 2017-02-01 28
3 2017-03-01 31
4 2017-04-01 30
5 2017-05-01 31
6 2017-06-01 30
7 2017-07-01 31
8 2017-08-01 31
9 2017-09-01 30
10 2017-10-01 31
11 2017-11-01 30
12 2017-12-01 1
See "Different input and output variable" section of "Programming with dplyr":
We create the new names by pasting together strings, so we need
quo_name() to convert the input expression to a string.
The error is caused by summarise(df, !!name := n()) and is solved by replacing the second line of the function with
name <- substitute(name)
The reason, as far as I understand it is, that a quosure is not only its name, but it carries with it the environment from where it came. This makes sense when specifying column names in functions. The function must know from which data frame (=environment in this case) the column comes to replace the name with the values.
However, name shall take a new name, specified by the user. There is nothing to replace it with. I suspect if using name <- enquo(name), R wants to replace !!name by values instead of just putting in the new name. Therefore it complains that on the LHS there is no name (because R replaced it by values(?))
Not sure though if substitute is the ideomatic "programming with dplyr" way though. Comments are welcome.
Create a dataframe showing customer IDs and first purchase dates:
dates <- seq.Date(as.Date("2017-01-01"), as.Date("2017-12-01"), by="day")
dates_rep <- c(dates,dates,dates)
cust_ids <- paste('id_', floor(runif(length(dates_rep), min=0, max=100000)))
cust_frame <- data.frame(ID=cust_ids, FP_DATE=dates_rep)
head(cust_frame)
Use the plyr package to aggregate by FP_DATE:
library(plyr)
count(cust_frame, c('FP_DATE'))
Therefore, given a dataframe of customers and first purchase dates, we get a count of the number of customers purchasing for the first time in each month.
You can extend this to aggregate across any number of features in your dataset:
count(cust_frame, c('FP_DATE', 'feature_b', 'feature_c', 'feature_d', 'feature_e'))
The end goal is to visualize the amount of a medication taken per day across a large sample of individuals. I'm trying to reshape my data to make a stacked area chart (or something similar).
In a more general term; I have my data structured as below:
id med start_date end_date
1 drug_a 2010-08-24 2011-03-03
2 drug_a 2011-06-07 2011-08-12
3 drug_b 2010-03-26 2010-10-31
4 drug_b 2012-08-14 2013-01-31
5 drug_c 2012-03-01 2012-06-20
5 drug_a 2012-04-01 2012-06-14
I think I'm trying to create a data frame with one row per date, and a column summing the total of patients (id) that are taking that drug on that day. For example, if someone is taking drug_a from 2010-01-01 to 2010-01-20, each of those drug-days should count.
Something like:
date drug_a drug_b drug_c
2010-01-01 5 0 10
2010-01-02 10 2 8
I'm functional with dplyr and tidyr, but unsure how to use spread with dates and durations.
I'd expand out the data to use all dates using a do loop:
library(dplyr)
library(tidyr)
library(zoo)
df %>%
group_by(id, med) %>%
do(with(.,
data_frame(
date = (start_date:end_date) %>% as.Date) ) ) %>%
group_by(date, med) %>%
summarize(frequency = n() ) %>%
spread(med, frequency)