In my code below, i would like to find the cumsum for each year. Right now, Variable A is being summed for the entire duration. Any help would be appreciated.
library(dplyr)
library(lubridate)
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
group_by(Year, JDay) %>%
mutate(Precipitation = cumsum(A))
Just remove JDay from grouping variables
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
group_by(Year) %>%
mutate(Precipitation = cumsum(A)) %>%
ungroup()
It seems the issue here is with your grouping clause. Specifically, as there are as many distinct combinations of Year and JDay in your data as there are rows in DF, the subsequent cumsum operation inside mutate will simply return the same value as the input column, A. I believe the following should give you what you're after
library(dplyr)
library(lubridate)
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
arrange(Year, JDay) %>%
group_by(Year) %>%
mutate(Precipitation = cumsum(A)) %>%
ungroup()
# illustrate that Precipitation does indeed give the cumulative value of A for
# each year by printing the first 5 observations for each year in DF1
DF1 %>%
group_by(Year) %>%
slice(1:5)
#> # A tibble: 15 x 6
#> # Groups: Year [3]
#> date A Year Month JDay Precipitation
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001-05-01 6.25 2001 5 121 6.25
#> 2 2001-05-02 0.188 2001 5 122 6.43
#> 3 2001-05-03 5.37 2001 5 123 11.8
#> 4 2001-05-04 5.55 2001 5 124 17.4
#> 5 2001-05-05 5.15 2001 5 125 22.5
#> 6 2002-05-01 2.95 2002 5 121 2.95
#> 7 2002-05-02 6.75 2002 5 122 9.71
#> 8 2002-05-03 7.77 2002 5 123 17.5
#> 9 2002-05-04 8.13 2002 5 124 25.6
#> 10 2002-05-05 5.58 2002 5 125 31.2
#> 11 2003-05-01 9.98 2003 5 121 9.98
#> 12 2003-05-02 8.24 2003 5 122 18.2
#> 13 2003-05-03 6.13 2003 5 123 24.4
#> 14 2003-05-04 5.22 2003 5 124 29.6
#> 15 2003-05-05 9.81 2003 5 125 39.4
Here is a data.table solution for that. If you want the cumsum for each year, but show only the interval from month 5 to 10, this would be data.table code for that:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:lubridate':
#>
#> hour, isoweek, mday, minute, month, quarter, second, wday, week,
#> yday, year
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
data.table(DF)[, `:=` (Year = year(date), Month = month(date), JDay = yday(date))][, Precipitation := cumsum(A), by=Year][between(Month, 5, 10)][]
#> date A Year Month JDay Precipitation
#> 1: 2001-05-01 6.2465000 2001 5 121 568.9538
#> 2: 2001-05-02 0.1877191 2001 5 122 569.1416
#> 3: 2001-05-03 5.3717570 2001 5 123 574.5133
#> 4: 2001-05-04 5.5457454 2001 5 124 580.0591
#> 5: 2001-05-05 5.1508288 2001 5 125 585.2099
#> ---
#> 548: 2003-10-27 0.1979292 2003 10 300 1479.8115
#> 549: 2003-10-28 6.7286553 2003 10 301 1486.5402
#> 550: 2003-10-29 8.7215420 2003 10 302 1495.2617
#> 551: 2003-10-30 8.2572257 2003 10 303 1503.5190
#> 552: 2003-10-31 9.6567923 2003 10 304 1513.1757
If you want the cumsum for the months 5-10 only, you would put the filter before calculating the cumsum:
data.table(DF)[, `:=` (Year = year(date), Month = month(date), JDay = yday(date))][between(Month, 5, 10)][, Precipitation := cumsum(A), by=Year][]
#> date A Year Month JDay Precipitation
#> 1: 2001-05-01 6.2465000 2001 5 121 6.246500
#> 2: 2001-05-02 0.1877191 2001 5 122 6.434219
#> 3: 2001-05-03 5.3717570 2001 5 123 11.805976
#> 4: 2001-05-04 5.5457454 2001 5 124 17.351722
#> 5: 2001-05-05 5.1508288 2001 5 125 22.502550
#> ---
#> 548: 2003-10-27 0.1979292 2003 10 300 916.597973
#> 549: 2003-10-28 6.7286553 2003 10 301 923.326629
#> 550: 2003-10-29 8.7215420 2003 10 302 932.048171
#> 551: 2003-10-30 8.2572257 2003 10 303 940.305396
#> 552: 2003-10-31 9.6567923 2003 10 304 949.962189
Related
I have data resembling the following structure, where the when variable denotes the day of measurement:
## Generate data.
set.seed(1986)
n <- 1000
y <- rnorm(n)
when <- as.POSIXct(strftime(seq(as.POSIXct("2021-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
as.POSIXct("2022-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
length.out = n), format = "%Y-%m-%d"))
dta <- data.frame(y, when)
head(dta)
#> y when
#> 1 -0.04625141 2021-11-01
#> 2 0.28000082 2021-11-01
#> 3 0.25317063 2021-11-01
#> 4 -0.96411077 2021-11-02
#> 5 0.49222664 2021-11-02
#> 6 -0.69874551 2021-11-02
I need to compute averages of y over time. For instance, the following computes daily averages:
## Compute daily averages of y.
library(dplyr)
daily_avg <- dta %>%
group_by(when) %>%
summarise(daily_mean = mean(y)) %>%
ungroup()
daily_avg
#> # A tibble: 366 × 2
#> when daily_mean
#> <dttm> <dbl>
#> 1 2021-11-01 00:00:00 0.162
#> 2 2021-11-02 00:00:00 -0.390
#> 3 2021-11-03 00:00:00 -0.485
#> 4 2021-11-04 00:00:00 -0.152
#> 5 2021-11-05 00:00:00 0.425
#> 6 2021-11-06 00:00:00 0.726
#> 7 2021-11-07 00:00:00 0.855
#> 8 2021-11-08 00:00:00 0.0608
#> 9 2021-11-09 00:00:00 -0.995
#> 10 2021-11-10 00:00:00 0.395
#> # … with 356 more rows
I am having a hard time computing weekly averages. Here is what I have tried so far:
## Fail - compute weekly averages of y.
library(lubridate)
dta$week <- week(dta$when) # This is wrong.
dta[165: 171, ]
#> y when week
#> 165 0.9758333 2021-12-30 52
#> 166 -0.8630091 2021-12-31 53
#> 167 0.3054031 2021-12-31 53
#> 168 1.2814421 2022-01-01 1
#> 169 0.1025440 2022-01-01 1
#> 170 1.3665411 2022-01-01 1
#> 171 -0.5373058 2022-01-02 1
Using the week function from the lubridate package ignores the fact that my data spawn across years. So, if I were to use a code similar to the one I used for the daily averages, I would aggregate observations belonging to different years (but to the same week number). How can I solve this?
You can use %V (from ?strptime) for weeks, combining it with the year.
dta %>%
group_by(week = format(when, format = "%Y-%V")) %>%
summarize(daily_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 2
# week daily_mean
# <chr> <dbl>
# 1 2021-44 0.179
# 2 2021-45 0.0477
# 3 2021-46 0.0340
# 4 2021-47 0.356
# 5 2021-48 0.0544
# 6 2021-49 -0.0948
# 7 2021-50 -0.0419
# 8 2021-51 0.209
# 9 2021-52 0.251
# 10 2022-01 -0.197
# # ... with 44 more rows
There are different variants of "week", depending on your preference.
%V
Week of the year as decimal number (01–53) as defined in ISO 8601.
If the week (starting on Monday) containing 1 January has four or more
days in the new year, then it is considered week 1. Otherwise, it is
the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.)
%W
Week of the year as decimal number (00–53) using Monday as the first
day of week (and typically with the first Monday of the year as day 1
of week 1). The UK convention.
You can extract year and week from the dates and group by both:
dta %>%
mutate(year = year(when),
week = week(when)) %>%
group_by(year, week) %>%
summarise(y_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 3
# # Groups: year, week [54]
# year week y_mean
# <dbl> <dbl> <dbl>
# 1 2021 44 -0.222
# 2 2021 45 0.234
# 3 2021 46 0.0953
# 4 2021 47 0.206
# 5 2021 48 0.192
# 6 2021 49 -0.0831
# 7 2021 50 0.0282
# 8 2021 51 0.196
# 9 2021 52 0.132
# 10 2021 53 -0.279
# # ... with 44 more rows
I have a data frame with five columns:
year<- c(2000,2000,2000,2001,2001,2001,2002,2002,2002)
k<- c(12.5,11.5,10.5,-8.5,-9.5,-10.5,13.9,14.9,15.9)
pop<- c(143,147,154,445,429,430,178,181,211)
pop_obs<- c(150,150,150,440,440,440,185,185,185)
df<- data_frame(year,k,pop,pop_obs)
df<-
year k pop pop_obs
<dbl> <dbl> <dbl> <dbl>
1 2000 12.5 143 150
2 2000 11.5 147 150
3 2000 10.5 154 150
4 2001 -8.5 445 440
5 2001 -9.5 429 440
6 2001 -10.5 430 440
7 2002 13.9 178 185
8 2002 14.9 181 185
9 2002 15.9 211 185
what I want is, based on each year and each k which value of pop has minimum difference of pop_obs. finally, I want to keep result as a data frame based on each year and each k.
my expected output would be like this:
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2003 14.9
You could try with dplyr
df<- data.frame(year,k,pop,pop_obs)
library(dplyr)
df %>%
mutate(diff = abs(pop_obs - pop)) %>%
group_by(year) %>%
filter(diff == min(diff)) %>%
select(year, k)
#> # A tibble: 3 x 2
#> # Groups: year [3]
#> year k
#> <dbl> <dbl>
#> 1 2000 11.5
#> 2 2001 -8.5
#> 3 2002 14.9
Created on 2021-12-11 by the reprex package (v2.0.1)
Try tidyverse way
library(tidyverse)
data_you_want = df %>%
group_by(year, k)%>%
mutate(dif=pop-pop_obs)%>%
ungroup() %>%
arrange(desc(dif)) %>%
select(year, k)
Using base R
subset(df, as.logical(ave(abs(pop_obs - pop), year,
FUN = function(x) x == min(x))), select = c('year', 'k'))
# A tibble: 3 × 2
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2002 14.9
This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 2 years ago.
Say we have a dataframe looking like this one below:
month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89
My goal is to tweak this in order to get a dataframe giving me the percentages of each value in issue related to each month. Plainly speaking, my desired output should look like follows:
month issue rate
Jan withdrawal 57.47
Jan delay 27.59
Jan other 14.94
Feb withdrawal 47.25
Feb delay 30.50
Feb other 22.25
I've tried helping myself with dplyr but my attempts had all been unsuccessful so far.
library(tidyverse)
df <- read.table(text = "month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89", header = T)
df
#> month issue amount
#> 1 Jan withdrawal 250
#> 2 Jan delay 120
#> 3 Jan other 65
#> 4 Feb withdrawal 189
#> 5 Feb delay 122
#> 6 Feb other 89
df %>%
group_by(month) %>%
mutate(rate = amount / sum(amount, na.rm = T) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
df %>%
group_by(month) %>%
mutate(rate = prop.table(amount) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
Created on 2021-01-26 by the reprex package (v0.3.0)
using data.table
library(data.table)
setDT(df)
df[, rate := prop.table(amount) * 100, by = list(month)]
df
#> month issue amount rate
#> 1: Jan withdrawal 250 57.47126
#> 2: Jan delay 120 27.58621
#> 3: Jan other 65 14.94253
#> 4: Feb withdrawal 189 47.25000
#> 5: Feb delay 122 30.50000
#> 6: Feb other 89 22.25000
Created on 2021-01-26 by the reprex package (v0.3.0)
Try the answer here . it might be what you're looking for. Code is pasted below for your convenience.
library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
In your case it would probably be something like :
library(dplyr)
group_by(df, month) %>% mutate(rate= amount/sum(amount))
Or alternatively :
group_by(df, month) %>% transmute(issue, rate= amount/sum(amount))
I'm trying to calculate year-wise standard error for the variable AcrePrice. I'm running the function stderr (also tried with sd(acrePrice)/count(n)). Both of these return an error.
Here's the relevant code:
library(alr4)
library(tidyverse)
MinnLand %>% group_by(year) %>% summarize(sd(acrePrice)/count(n))
MinnLand %>% group_by(year) %>% summarize(stderr(acrePrice))
Why is there a problem? The mean and SDs are easily calculated.
The issue with the first function is count, which requires a data.frame, instead it would be n()
library(dplyr)
MinnLand %>%
group_by(year) %>%
summarize(SE = sd(acrePrice)/n(), .groups = 'drop')
-output
# A tibble: 10 x 2
# year SE
# <dbl> <dbl>
# 1 2002 2.25
# 2 2003 0.840
# 3 2004 0.742
# 4 2005 0.862
# 5 2006 0.849
# 6 2007 0.765
# 7 2008 0.708
# 8 2009 1.23
# 9 2010 0.986
#10 2011 1.95
According to ?stderr
stdin(), stdout() and stderr() are standard connections corresponding to input, output and error on the console respectively (and not necessarily to file streams).
We can use std.error from plotrix
library(plotrix)
MinnLand %>%
group_by(year) %>%
summarize(SE = std.error(acrePrice))
-output
# A tibble: 10 x 2
# year SE
# <dbl> <dbl>
# 1 2002 53.4
# 2 2003 38.6
# 3 2004 37.0
# 4 2005 41.5
# 5 2006 39.7
# 6 2007 36.3
# 7 2008 34.9
# 8 2009 47.1
# 9 2010 42.1
#10 2011 63.6
I have a dataframe df
str(df)
'data.frame': 396 obs. of 23 variables:
$ Year : chr "1986" "1986" "1986" "1986" ...
$ Month : chr "Jan" "Feb" "Mar" "Apr" ...
$ Season : Factor w/ 4 levels "Monsoon","PostMonsoon",..: 4 4 3 3 3 1 1 1 2 2 ...
$ stn1 : num 2.3 42.3 91.1 267.4 482.1 ...
$ stn2 : num 0 9 23.8 61.7 68.3 ...
$ stn3 : num 0 10 34.6 52.5 122 …
I want to calculate the 3 month rolling mean and I tried the following code
library(zoo)
Roll_mean <- function(x){rollmean(x,3)} #creating a function to calculate 3 month rolling mean
monthroll_mean <- df[-2] %>% group_by(Year, Season) %>% summarise_all(list(Roll_mean))
However I do not get what I expect.
For instance, I want my final dataframe to be like this.
head(monthroll_mean)
Year Months stn1 stn2 stn3
<chr> <fct> <dbl> <dbl> <dbl>
1 1986 DJF 32.0 10.0 10
2 1986 JFM 1157. 141. 282.
3 1986 FMA 280. 51.3 69.7
4 1986 MAM 352. 78.5 121.
5 1986 AMJ 27.3 4.47 20.5
6 1986 MJJ 1005. 139. 235
How can I create a "Months" column which takes 3 months rolling mean.
Your help would be appreciated
Try something like this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- expand.grid(year=1991:1995,
month=month.abb) %>%
arrange(year,month) %>%
mutate(var1=runif(nrow(.),10,20),var2=runif(nrow(.),10,20))
head(df,10) #dummy data
#> year month var1 var2
#> 1 1991 Jan 13.19097 12.44519
#> 2 1991 Feb 17.02439 10.55053
#> 3 1991 Mar 11.21088 17.08536
#> 4 1991 Apr 19.73014 17.60298
#> 5 1991 May 12.72299 12.95819
#> 6 1991 Jun 17.19959 19.90314
#> 7 1991 Jul 11.47601 17.77892
#> 8 1991 Aug 10.43157 14.51260
#> 9 1991 Sep 13.66881 14.34805
#> 10 1991 Oct 13.50884 11.62024
library(zoo)
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
rolldf <- df %>%
mutate(months=rollapply(month,3,function(x){
paste(substr(x,1,1),collapse = '')
},align='right',fill=NA)) %>%
mutate_at(vars(var1,var2),~rollmean(.,k=3,align = 'right',fill=NA))
head(rolldf)
#> year month var1 var2 months
#> 1 1991 Jan NA NA <NA>
#> 2 1991 Feb NA NA <NA>
#> 3 1991 Mar 13.80875 13.36036 JFM
#> 4 1991 Apr 15.98847 15.07962 FMA
#> 5 1991 May 14.55467 15.88217 MAM
#> 6 1991 Jun 16.55091 16.82144 AMJ
Created on 2020-02-13 by the reprex package (v0.3.0)
You can try :
library(dplyr)
df %>%
group_by(Year, Season) %>%
mutate_at(vars(starts_with('stn')), zoo::rollapplyr, 3, fill = NA)