I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)
It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows
I have a dataset below in which I want to do linear regression for each country and state and then cbind the predicted values in the dataset:
Final data frame after adding three more columns:
I have done it for one country and one area but want to do it for each country and area and put the predicted, upper and lower limit values back in the data set by cbind:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
data_1 <- data[(data$country=="US" & data$Area=="G"),]
model <- lm(amount ~ week, data = data_1)
pre <- predict(model,newdata = data_1,interval = "prediction",level = 0.95)
pre
How can I loop this for other combination of country and Area?
...and a Base R solution:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
splitVar <- paste0(data$country,"-",data$Area)
dfList <- split(data,splitVar)
result <- do.call(rbind,lapply(dfList,function(x){
model <- lm(amount ~ week, data = x)
cbind(x,predict(model,newdata = x,interval = "prediction",level = 0.95))
}))
result
...the results:
country Area week amount fit lwr upr
UK-A.11 UK A 1 45 36.6 -6.0463638 79.24636
UK-A.12 UK A 2 34 37.1 -1.3409128 75.54091
UK-A.13 UK A 3 23 37.6 0.6671656 74.53283
UK-A.14 UK A 4 43 38.1 -0.3409128 76.54091
UK-A.15 UK A 5 43 38.6 -4.0463638 81.24636
US-G.1 US G 1 12 20.8 -27.6791493 69.27915
US-G.2 US G 2 23 21.7 -21.9985147 65.39851
US-G.3 US G 3 34 22.6 -19.3841749 64.58417
US-G.4 US G 4 32 23.5 -20.1985147 67.19851
US-G.5 US G 5 12 24.4 -24.0791493 72.87915
US-I.6 US I 1 12 20.8 -33.8985900 75.49859
US-I.7 US I 2 34 30.5 -18.8046427 79.80464
US-I.8 US I 3 45 40.2 -7.1703685 87.57037
US-I.9 US I 4 65 49.9 0.5953573 99.20464
US-I.10 US I 5 45 59.6 4.9014100 114.29859
We can also use function augment from package broom to get your desired information:
library(purrr)
library(broom)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(models = map(data, ~ lm(amount ~ week, data = .)),
aug = map(models, ~ augment(.x, interval = "prediction"))) %>%
unnest(aug) %>%
select(country, Area, amount, week, .fitted, .lower, .upper)
# A tibble: 15 x 7
# Groups: country, Area [3]
country Area amount week .fitted .lower .upper
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US G 12 1 20.8 -27.7 69.3
2 US G 23 2 21.7 -22.0 65.4
3 US G 34 3 22.6 -19.4 64.6
4 US G 32 4 23.5 -20.2 67.2
5 US G 12 5 24.4 -24.1 72.9
6 US I 12 1 20.8 -33.9 75.5
7 US I 34 2 30.5 -18.8 79.8
8 US I 45 3 40.2 -7.17 87.6
9 US I 65 4 49.9 0.595 99.2
10 US I 45 5 59.6 4.90 114.
11 UK A 45 1 36.6 -6.05 79.2
12 UK A 34 2 37.1 -1.34 75.5
13 UK A 23 3 37.6 0.667 74.5
14 UK A 43 4 38.1 -0.341 76.5
15 UK A 43 5 38.6 -4.05 81.2
Here is a tidyverse way to do this for every combination of country and Area.
library(tidyverse)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(model = map(data, ~ lm(amount ~ week, data = .x)),
result = map2(model, data, ~data.frame(predict(.x, newdata = .y,
interval = "prediction",level = 0.95)))) %>%
ungroup %>%
select(-model) %>%
unnest(c(data, result))
# country Area week amount fit lwr upr
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 US G 1 12 20.8 -27.7 69.3
# 2 US G 2 23 21.7 -22.0 65.4
# 3 US G 3 34 22.6 -19.4 64.6
# 4 US G 4 32 23.5 -20.2 67.2
# 5 US G 5 12 24.4 -24.1 72.9
# 6 US I 1 12 20.8 -33.9 75.5
# 7 US I 2 34 30.5 -18.8 79.8
# 8 US I 3 45 40.2 -7.17 87.6
# 9 US I 4 65 49.9 0.595 99.2
#10 US I 5 45 59.6 4.90 114.
#11 UK A 1 45 36.6 -6.05 79.2
#12 UK A 2 34 37.1 -1.34 75.5
#13 UK A 3 23 37.6 0.667 74.5
#14 UK A 4 43 38.1 -0.341 76.5
#15 UK A 5 43 38.6 -4.05 81.2
And one more:
library(tidyverse)
data %>%
mutate(CountryArea=paste0(country,Area) %>% factor %>% fct_inorder) %>%
split(.$CountryArea) %>%
map(~lm(amount~week, data=.)) %>%
map(predict, interval = "prediction",level = 0.95) %>%
reduce(rbind) %>%
cbind(data, .)
country Area week amount fit lwr upr
1 US G 1 12 20.8 -27.6791493 69.27915
2 US G 2 23 21.7 -21.9985147 65.39851
3 US G 3 34 22.6 -19.3841749 64.58417
4 US G 4 32 23.5 -20.1985147 67.19851
5 US G 5 12 24.4 -24.0791493 72.87915
6 US I 1 12 20.8 -33.8985900 75.49859
7 US I 2 34 30.5 -18.8046427 79.80464
8 US I 3 45 40.2 -7.1703685 87.57037
9 US I 4 65 49.9 0.5953573 99.20464
10 US I 5 45 59.6 4.9014100 114.29859
11 UK A 1 45 36.6 -6.0463638 79.24636
12 UK A 2 34 37.1 -1.3409128 75.54091
13 UK A 3 23 37.6 0.6671656 74.53283
14 UK A 4 43 38.1 -0.3409128 76.54091
15 UK A 5 43 38.6 -4.0463638 81.24636
I am trying to find mean of A and B for each row and save it as separate column but seems like the code only average the first row and fill the rest of the rows with that value. Any suggestion how to fix this?
library(tidyverse)
library(lubridate)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("2001-01-01"), to = as.Date("2003-12-31"), by = "day"),
A = runif(1095, 1,60),
Z = runif(1095, 5,100)) %>%
mutate(MeanofAandZ= mean(A:Z))
Are you looking for this:
DF %>% rowwise() %>% mutate(MeanofAandZ = mean(c_across(A:Z)))
# A tibble: 1,095 x 4
# Rowwise:
Date A Z MeanofAandZ
<date> <dbl> <dbl> <dbl>
1 2001-01-01 26.5 7.68 17.1
2 2001-01-02 54.9 33.1 44.0
3 2001-01-03 37.1 82.0 59.5
4 2001-01-04 6.91 18.0 12.4
5 2001-01-05 53.0 8.76 30.9
6 2001-01-06 26.1 7.63 16.9
7 2001-01-07 59.3 30.8 45.0
8 2001-01-08 39.9 14.6 27.3
9 2001-01-09 59.2 93.6 76.4
10 2001-01-10 30.7 89.1 59.9
you can do it with Base R: rowMeans
Full Base R:
DF$MeanofAandZ <- rowMeans(DF[c("A", "Z")])
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
or inside a mutate:
library(dplyr)
DF <- DF %>% mutate(MeanofAandZ = rowMeans(cbind(A,Z)))
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
We can also do
DF$MeanofAandZ <- Reduce(`+`, DF[c("A", "Z")])/2
Or using apply
DF$MeanofAandZ <- apply(DF[c("A", "Z")], 1, mean)
This question already has answers here:
R Sort strings according to substring
(2 answers)
Closed 2 years ago.
I got the dataframe (code) and I I want to sort it according to combName in a numerical order.
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14.0
# ... with 1,098 more rows
Afterwards the left column should look like:
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-2 ...
3 20-3 ...
4 20-4 ...
5 20-5 ...
...
10 20-10 14.7
# ... with 1,098 more rows
It do not know what I can do to reach this format.
Does this work:
library(dplyr)
library(tidyr)
df
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-102 25.4
2 20-100 21.2
3 20-101 17.6
4 20-105 24.3
5 20-10 14.7
6 20-103 46.3
7 20-104 68.7
8 20-1 8.05
9 20-106 46.3
10 20-107 14
df %>% separate(combName, into = c('1','2'), sep = '-', remove = F) %>%
type.convert(as.is = T) %>% arrange(`1`,`2`) %>% select(-c(`1`,`2`))
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14
What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)