Tidy a column in a data frame by using mean function [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 3 years ago.
I have the following data of Unemployement per Year and quarter, but in my data frame is up to 2018, but I will use only 2 years for exemple.
Year Unemployement
1997Q3 1914
1997Q4 1697
1998Q1 1702
1998Q2 1645
1998Q3 1742
1998Q4 1605
What code can I use in order to tidy the Year column and to have the following data, and mainly to obtain the unemployment number by calculating the mean of each data per year: 1997 and 1998 (+ for other years that I have in my data frame). In the final version, I would like to have only one data of Unemployment per year, which theoretically shoud be the average of all Quaters
Year Unemployement
1997 1805.50
1998 1673.50
Thank you!

##Data entry
library(tidyverse)
df<- tribble(
~Year,~Quarter,~Unemployement,
1997,"Q3",1914,
1997,"Q4",1697,
1998,"Q1",1702,
1998,"Q2",1645,
1998,"Q3",1742,
1998,"Q4",1605
)
##Solution
df%>%
group_by(Year)%>%
summarise(mean_year = mean(Unemployement))
# A tibble: 2 x 2
Year mean_year
<dbl> <dbl>
1 1997 1806.
2 1998 1674.
## 2nd Version (first separate the Year-column)
df%>%
separate(Year, c("Year", "Quarter"))%>%
group_by(Year)%>%
summarise(mean_year = mean(Unemployement))

Related

Calculating which quarters occur in a timeframe

My dataset has monthly reporting which needs to be summed to return both the quarterly value and the 12 month rolling rate. I have successfully created a column specifying which quarter each row is from by using df$Quarter <- quarter(df$Month, fiscal_start = 4, with_year = T), this returns as 2022.1etc which I then use as part of my group_by to sum all values in that quarter. I however now need to create a row for each area which returns the 4 quarter sum based upon when I update the dataset, which will be done quarterly.
If this were my data I would want it to end up something like the second table
|Area|Quarter|Measure_1|
|----|-------|---------|
|Area_a|2022.1|5|
|Area_a|2021.4|1|
|Area_a|2021.3|2|
|Area_a|2021.2|6|
|Area_b|2022.1|9|
|Area_b|2021.4|7|
|Area_b|2021.3|2|
|Area_b|2021.2|1|
It doesn't need to be exactly like this but this is the rough idea of what I want to happen
Area
Quarter
Measure_1
Timeframe
Area_a
2022.1
5
Quarterly
Area_a
2021.4
1
Quarterly
Area_a
2021.3
2
Quarterly
Area_a
2021.2
6
Quarterly
Area_a
2022.1
14
12 month rolling
Area_b
2022.1
9
Quarterly
Area_b
2021.4
7
Quarterly
Area_b
2021.3
2
Quarterly
Area_b
2021.2
1
Quarterly
Area_b
2022.1
19
12 month rolling
The following code produces the required results from your sample data. If your real data covers multiple years for each Area, then you would have to calculate a year variable and then include it along with Area in the group_by().
want <- df %>%
group_by(Area) %>%
# use summarise to calculate totals
# Quarter variable will be used to sort the output, new_Quarter will ensure
# that the total row has the maximum Quarter value for that area
summarise(new_Quarter=max(Quarter), Quarter=min(Quarter)-0.05, Measure_1=sum(Measure_1)) %>%
bind_rows(df) %>% # combine the totals with the original data
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Quarter=if_else(is.na(new_Quarter),Quarter,new_Quarter), # assign maximum Quarter to total row
Timeframe=if_else(is.na(new_Quarter),'Quarterly','12 month rolling')) %>% # add label
select(-new_Quarter) # remove temporary variable

Extract one year from a period of time involving two years in R [duplicate]

This question already has answers here:
Assigning Dates to Fiscal Year
(4 answers)
Closed 2 years ago.
I need to extract a year corresponding to a time period and put it into a new column in a data frame. The tricky part is that the years I need to extract are not calendar years. My 'standardized' years start the 1st of July of a given calendar year and end up the 30th of June of the subsequent calendar year. So, if an event occurred anytime within this time period, the standardized year is the first calendar year in this time period involving two calendar years. For example, if events occurred the 1st of July 2019, the 25th of December 2019, and the 30th of June 2020, the 'standardized' year for all these three events is 2019 (i.e. the first year within the time period between 1st of Jul 2019 and 30th June 2020). How can I extract such standardized years in R and assigned it to a new column in the data frame?
My data file is very large, but as a simplified example, here are some events happening in specific dates
dat <- as.Date(c("2-Feb-18", "24-May-10", "30-Dec-19","1-Jul-20"),"%d-%b-%y")
dat <- as.data.frame(dat)
names(dat)[1] <- "events"
dat
events
1 2018-02-02
2 2010-05-24
3 2019-12-30
4 2020-07-01
In this case the column I want to create 'standardized_year' should look like this
events standardized_year
1 2018-02-02 2017
2 2010-05-24 2009
3 2019-12-30 2019
4 2020-07-01 2020
In the first row, the standardized year is 2017 because 2-Feb-18 is within 1st of Jul 2017 and 30th Jun 2018, so the first year is extracted within the 'standardized' year. The same criteria for all other values.
Is there a way to do this automatically in R for a large amount of events in a dataframe?
Any help would be much appreciated. Thanks
You could extract the actual year from first four substrings, then compare if it is smaller than all dates of the standardized year starting in the actual year. The standardized year we get from sequences with ISOdate.
standardized_year <- sapply(dat$events, function(x) {
x <- as.POSIXct(x)
y <- as.numeric(substr(x, 1, 4))
ifelse(all(x < seq(ISOdate(y, 7, 1, 0), ISOdate(y + 1, 6, 30, 0), "day")), y - 1, y)
})
dat <- cbind(dat, standardized_year)
dat
# events standardized_year
# 1 2018-02-02 2017
# 2 2010-05-24 2009
# 3 2019-12-30 2019
# 4 2020-07-01 2020
I guess you can do something likes this:
library(dplyr)
library(lubridate)
events <- events %>%
mutate(events = dmy(events),
standardized_year = ifelse(month(events)>=7, year(events), year(events)-1))

Calculations by Subgroup in a Column [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 5 years ago.
I have a dataset that looks approximately like this:
> dataSet
month detrend
1 Jan 315.71
2 Jan 317.45
3 Jan 317.5
4 Jan 317.1
5 Jan 315.71
6 Feb 317.45
7 Feb 313.5
8 Feb 317.1
9 Feb 314.37
10 Feb 315.41
11 March 316.44
12 March 315.73
13 March 318.73
14 March 315.55
15 March 312.64
.
.
.
How do I compute the average by month? E.g., I want something like
> by_month
month ave_detrend
1 Jan 315.71
2 Feb 317.45
3 March 317.5
What you need to focus on is a means to group your column of interest (the "detrend") by the month. There are ways to do this within "vanilla R", but the most effective way is to use tidyverse's dplyr.
I will use the example taken directly from that page:
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
In your case, that would be:
by_month <- dataSet %>%
group_by(month) %>%
summarize(avg = mean(detrend))
This new "tidyverse" style looks quite different, and you seem quite new, so I'll explain what's happening (sorry if this is overly obvious):
First, we are grabbing the dataframe, which I'm calling dataSet.
Then we are piping that dataset to our next function, which is group_by. Piping means that we're putting the results of the last command (which in this case is just the dataframe dataSet) and using it as the first parameter of our next function. The function group_by has a dataframe provided as its first function.
Then the results of that group by are piped to the next function, which is summarize (or summarise if you're from down under, as the author is). summarize simply calculates using all the data in the column, however, the group_by function creates partitions in that column. So we now have the mean calculated for each partition that we've made, which is month.
This is the key: group_by creates "flags" so that summarize calculates the function (mean, in this case) separately on each group. So, for instance, all of the Jan values are grouped together and then the mean is calculated only on them. Then for all of the Feb values, the mean is calculated, etc.
HTH!!
R has an inbuilt mean function: mean(x, trim = 0, na.rm = FALSE, ...)
I would do something like this:
january <- dataset[dataset[, "month"] == "january",]
januaryVector <- january[, "detrend"]
januaryAVG <- mean(januaryVector)

How to get a total number of rows and of rows that have '1' value, grouping by month in r? [duplicate]

This question already has an answer here:
Compute monthly averages from daily data
(1 answer)
Closed 5 years ago.
I have a data set. as an example, a data for a year of 2016.
Let's say that there is 365 observations, from Jan 1st to Dec 31st of 2016. each day, the data includes either one or zero.
I am trying to calculate the percentage of ones for each month.
I will appreciate for helps experts!
This should work:
df = data.frame(date=seq(as.Date("2017-01-01"),as.Date("2017-12-31"),by=1) , value=sample(c(0,1),365,replace=T) )
library(dplyr)
df = df %>% mutate(month = format(date,"%m")) %>% # or %b for month abbreviation
group_by(month) %>%
summarize(value=sum(value)/length(value))

averaging by months with daily data [duplicate]

This question already has answers here:
Get monthly means from dataframe of several years of daily temps
(3 answers)
Closed 5 years ago.
I have daily data with my matrix, divided into 6 columns - "Years, months, days, ssts, anoms, missing ".
I want to calculate the average of each month of SST in each year. (For example - 1981 - september - avg values sts of all days in sept), and I want to do the same for all the years. i am trying to work, my code, but am unable to do so.
You should use dplyr package in R. For this, we will call your data df
require(dplyr)
df.mnths <- df %>%
group_by(Years, months) %>%
summarise(Mean.SST = mean(SSTs))
df.years <- df %>%
group_by(Years) %>%
summarise(Mean.SST = mean(SSTs))
This is two new data sets that will have the mean(SST) for each month of each year in df.mnths, and another dataset that will have mean(SST) for all years in df.years.
In terms of data.table you can perform the following action
library(data.table)
dt[, average_sst := mean(SSTs), by = .(years,months)]
adding an extra column average_sst.
just suppose that your data is stored in a data.frame named "data":
years months SSTs
1 1981 1 -0.46939368
2 1981 1 0.03226932
3 1981 1 -1.60266798
4 1981 1 -1.53095676
5 1981 1 1.71177023
6 1981 1 -0.61309846
tapply(data$SSTs, list(data$years, data$months), mean)
tapply(data$SSTs, factor(data$years), mean)

Resources