R & dplyr: using mutate() to get yearly totals - r

Data like so:
data <- read.table(text="year level items
2014 a 12
2014 b 16
2014 c 7")
Would like to run that through mutate() and I guess group_by so I have a year and a total, so a row that's just:
year items
2014 35
Feel like it should be 101 simple but I can't quite figure this one out.

library(dplyr)
out <- data %>% group_by(year) %>%
summarize(items= sum(items, na.rm=T))

Related

Using indexing to perform mathematical operations on data frame in r

I'm struggling to perform basic indexing on a data frame to perform mathematical operations. I have a data frame containing all 50 US states with an entry for each month of the year, so there are 600 observations. I wish to find the difference between a value for the month of December minus the January value for each of the states. My data looks like this:
> head(df)
state year month value
1 AL 2020 01 2.7
2 AK 2020 01 5
3 AZ 2020 01 4.8
4 AR 2020 01 3.7
5 CA 2020 01 4.2
7 CO 2020 01 2.7
For instance, AL has a value in Dec of 4.7 and Jan value of 2.7 so I'd like to return 2 for that state.
I have been trying to do this with the group_by and summarize functions, but can't figure out the indexing piece of it to grab values that correspond to a condition. I couldn't find a resource for performing these mathematical operations using indexing on a data frame, and would appreciate assistance as I have other transformations I'll be using.
With dplyr:
library(dplyr)
df %>%
group_by(state) %>%
summarize(year_change = value[month == "12"] - value[month == "01"])
This assumes that your data is as you describe--every state has a single value for every month. If you have missing rows, or multiple observations in for a state in a given month, I would not expect this code to work.
Another approach, based row order rather than month value, might look like this:
library(dplyr)
df %>%
## make sure things are in the right order
arrange(state, month) %>%
group_by(state) %>%
summarize(year_change = last(value) - first(value))

How to choose multiple rows every nth rows in R

I am currently working on a daily time series data which looks like this
Date
streamflow
1985-10-01
24
1985-10-02
6
1985-10-03
12
1985-10-04
14
...
...
2010-09-30
21
What I need to do is select the data from Oct 5 to Oct 24 from each year. I know use slice() and seq() can select a row every nth row, but I don't know how to have it work on selecting multiple rows. Any suggestion will be greatly appreciated. Thank you in advance!
Assuming your Date column is a valid Date class, use filter:
library(dplyr)
library(lubridate)
your_data %>%
filter(
month(Date) == 10 &
day(Date) >= 5 &
day(Date) <= 24
)
If your data isn't Date class yet, throw in a mutate(Date = ymd(Date)) before the filter() step.

Building User Activity Cohorts

Thank you for your help - I am trying to build cohorts.
And I do get what I am looking for with ...
cohort3 <- transactions %>%
group_by(userId) %>%
mutate(first_transaction = min(createDate)) %>%
group_by(first_transaction, createDate) %>%
summarize(clients = n())
BUT ... as you can see by the result, I get data back for every single day.
We had 7 users that transacted on 2017-01-03 the first time.
2 of these users transacted on 2017-01-04.
4 of these users transacted on 2017-01-05 and so forth.
This is great - but it's too granular.
How do I modify the above code to summarize by month or better quarter?
Like:
Jan-2017 - 25 users transacted the first time.
Feb-2017 - 12 users from that cohort transacted again ... and so on.
Even better.
Q1 2017 - 78 users transacted.
Q2 2017 - 35 users of that Q1 2017 cohort transacted. etc
Thank you.
The lubridate package includes a quarter() function for determining what quarter of the year a given date falls into. Something along these lines should do what you want:
library(dplyr)
library(lubridate)
cohort3 <-
transactions %>%
group_by(userId) %>%
mutate(first_transaction = min(createDate),
quarter = quarter(first_transaction, year = TRUE) %>%
group_by(quarter) %>%
summarize(clients = n())

Find list of common levels over a time serie

I have been trying to do the following within the dplyr package but without success.
I need to find which levels of a certain factor column are present in every level of another factor column, in my case, it is a Year column. This could be an example dataset:
ID Year
1568 2013
1341 2013
1568 2014
1341 2014
1261 2014
1348 2015
1568 2015
1341 2015
So I would like a list of the ID names that are present in every year. In the above example would be:
"1568", "1341"
I have been trying with dplyr to first grou_by column Year and then summarise the data somehow, but withouth achieving it.
Thanks
Using dplyr, we can group_by ID and select the groups which has same number of unique Year as the complete data.
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Year) == n_distinct(.$Year)) %>%
pull(ID) %>%unique()
#[1] 1568 1341
Here is a base R solution using intersect() + split()
comm <- Reduce(intersect,split(df$ID,df$Year))
such that
> comm
[1] 1568 1341

Calculations by Subgroup in a Column [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 5 years ago.
I have a dataset that looks approximately like this:
> dataSet
month detrend
1 Jan 315.71
2 Jan 317.45
3 Jan 317.5
4 Jan 317.1
5 Jan 315.71
6 Feb 317.45
7 Feb 313.5
8 Feb 317.1
9 Feb 314.37
10 Feb 315.41
11 March 316.44
12 March 315.73
13 March 318.73
14 March 315.55
15 March 312.64
.
.
.
How do I compute the average by month? E.g., I want something like
> by_month
month ave_detrend
1 Jan 315.71
2 Feb 317.45
3 March 317.5
What you need to focus on is a means to group your column of interest (the "detrend") by the month. There are ways to do this within "vanilla R", but the most effective way is to use tidyverse's dplyr.
I will use the example taken directly from that page:
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
In your case, that would be:
by_month <- dataSet %>%
group_by(month) %>%
summarize(avg = mean(detrend))
This new "tidyverse" style looks quite different, and you seem quite new, so I'll explain what's happening (sorry if this is overly obvious):
First, we are grabbing the dataframe, which I'm calling dataSet.
Then we are piping that dataset to our next function, which is group_by. Piping means that we're putting the results of the last command (which in this case is just the dataframe dataSet) and using it as the first parameter of our next function. The function group_by has a dataframe provided as its first function.
Then the results of that group by are piped to the next function, which is summarize (or summarise if you're from down under, as the author is). summarize simply calculates using all the data in the column, however, the group_by function creates partitions in that column. So we now have the mean calculated for each partition that we've made, which is month.
This is the key: group_by creates "flags" so that summarize calculates the function (mean, in this case) separately on each group. So, for instance, all of the Jan values are grouped together and then the mean is calculated only on them. Then for all of the Feb values, the mean is calculated, etc.
HTH!!
R has an inbuilt mean function: mean(x, trim = 0, na.rm = FALSE, ...)
I would do something like this:
january <- dataset[dataset[, "month"] == "january",]
januaryVector <- january[, "detrend"]
januaryAVG <- mean(januaryVector)

Resources