I have weather data with NAs sporadically throughout and I want to calculate rolling means. I have been using the rollapplyr function within zoo but even though I include partial = TRUE, it still puts a NA whenever, for example, there is a NA in 1 of the 30 values to be averaged.
Here is the formula:
weather_rolled <- weather %>%
mutate(maxt30 = rollapplyr(max_temp, 30, mean, partial = TRUE))
Here's my data:
A tibble: 7,160 x 11
station_name date max_temp avg_temp min_temp rainfall rh avg_wind_speed dew_point avg_bare_soil_temp total_solar_rad
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VEGREVILLE 2019-01-01 0.9 -7.9 -16.6 1 81.7 20.2 -7.67 NA NA
2 VEGREVILLE 2019-01-02 5.5 1.5 -2.5 0 74.9 13.5 -1.57 NA NA
3 VEGREVILLE 2019-01-03 3.3 -0.9 -5 0.5 80.6 10.1 -3.18 NA NA
4 VEGREVILLE 2019-01-04 -1.1 -4.7 -8.2 5.2 92.1 8.67 -4.76 NA NA
5 VEGREVILLE 2019-01-05 -3.8 -6.5 -9.2 0.2 92.6 14.3 -6.81 NA NA
6 VEGREVILLE 2019-01-06 -3 -4.4 -5.9 0 91.1 16.2 -5.72 NA NA
7 VEGREVILLE 2019-01-07 -5.8 -12.2 -18.5 0 75.5 30.6 -16.9 NA NA
8 VEGREVILLE 2019-01-08 -17.4 -21.6 -25.7 1.2 67.8 16.1 -26.1 NA NA
9 VEGREVILLE 2019-01-09 -12.9 -15.1 -17.4 0.2 71.5 14.3 -17.7 NA NA
10 VEGREVILLE 2019-01-10 -13.2 -17.9 -22.5 0.4 80.2 3.38 -21.8 NA NA
# ... with 7,150 more rows
Essentially, whenever a NA appears midway through, it results in a lot of NAs for the rolling mean. I want to still calculate the rolling mean within that time frame, ignoring the NAs. Does anyone know a way to get around this? I have been searching online for hours to no avail.
Thanks!
Related
In package "openair",i want to use 'importAURN' find all 2021 AURN site data in one dataset.
i.e. merge all site data,or have other method find all 2021 AURN site data?
How can i do it.
This code can know all aurn site
importMeta(source = "aurn", all = FALSE)
each site code like
1.site=kc1
kc1 <- importAURN(site = "kc1", year = 2021)
date pm2.5 site code
1 2021-01-01 00:00:00 30.4 London N. Kensington KC1
2 2021-01-01 01:00:00 55.8 London N. Kensington KC1
3 2021-01-01 02:00:00 28.3 London N. Kensington KC1
4 2021-01-01 03:00:00 15.6 London N. Kensington KC1
5 2021-01-01 04:00:00 19.8 London N. Kensington KC1
site=AH
AH <- importAURN(site = "AH", year = 2021)
date pm2.5 site code
1 2021-01-01 00:00:00 5.33 Aberdeen ABD
2 2021-01-01 01:00:00 3.07 Aberdeen ABD
3 2021-01-01 02:00:00 2.64 Aberdeen ABD
4 2021-01-01 03:00:00 2.43 Aberdeen ABD
5 2021-01-01 04:00:00 2.38 Aberdeen ABD
Maybe this serves your purpose:
dat<- importMeta(source = "aurn", all = FALSE)
imported<- lapply(dat$code, importAURN, year = 2021)
This code store all data of aurn sites, and then apply importAURN function to each site by the code for year 2021, and then store the resulted data to a list named imported. Each element of the list contains the data of each site.
In case you want to merge all data from all elements in the imported list, you can use rbind this way:
merged <- do.call(rbind, imported)
For example, I want to import and then merge dataset from the first two sites:
first2sites <-lapply(dat$code[1:2], importAURN, year = 2021)
merged2 <- do.call(rbind, first2sites)
head(merged2)
# # A tibble: 6 x 12
# site code date nox no2 no o3 pm10 pm2.5 ws wd air_temp
# <chr> <fct> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Aberdeen ABD 2021-01-01 00:00:00 4.51 3.41 0.718 58.5 8.7 5.33 6 338. 3.9
# 2 Aberdeen ABD 2021-01-01 01:00:00 3.76 2.79 0.628 57.3 6 3.07 6.1 339. 3.9
# 3 Aberdeen ABD 2021-01-01 02:00:00 3.69 2.66 0.673 54.9 5.08 2.64 5.9 341. 4.4
# 4 Aberdeen ABD 2021-01-01 03:00:00 1.54 0.815 0.471 55.2 4.78 2.43 6.9 345. 4.1
# 5 Aberdeen ABD 2021-01-01 04:00:00 3.07 2.15 0.605 52.2 5.03 2.38 7 347. 3.7
# 6 Aberdeen ABD 2021-01-01 05:00:00 3.94 3.02 0.605 49.9 6.32 2.81 6.9 352. 3.5
tail(merged2)
# # A tibble: 6 x 12
# site code date nox no2 no o3 pm10 pm2.5 ws wd air_temp
# <chr> <fct> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Aberdeen Erroll Park ABD9 2021-12-31 18:00:00 175. 57.8 76.2 2.39 21.1 16.2 3.4 127. 5.4
# 2 Aberdeen Erroll Park ABD9 2021-12-31 19:00:00 143. 53.7 58.4 2.00 30.4 25.4 3.6 156. 5.9
# 3 Aberdeen Erroll Park ABD9 2021-12-31 20:00:00 175. 53.9 79.1 2.39 45.2 27.2 3.8 167. 6.4
# 4 Aberdeen Erroll Park ABD9 2021-12-31 21:00:00 177. 53.0 81.1 2.79 61.5 42.4 3.9 189. 6.9
# 5 Aberdeen Erroll Park ABD9 2021-12-31 22:00:00 215. 56.2 104. 2.79 41.0 29.6 4.4 194. 7.4
# 6 Aberdeen Erroll Park ABD9 2021-12-31 23:00:00 160. 43.7 75.9 8.98 25.3 20.6 5.9 200. 7.6
I maybe have a strange question...I have a dataframe as below:
Station Mean_length Diff
1 AMEL 28.1 -2.91
2 AMRU 21.1 -9.90
3 BALG 31.0 0
4 BORK 30.1 -0.921
5 BUSU 22.6 -8.38
6 CADZ 28.5 2.46
7 DOLL 27.9 -3.07
8 EGMO 28.3 -2.69
9 EIER 30.8 0.233
10 FANO 23.1 -7.89
Now from column "Diff" I want to get a new column and I want to add -0.5 to a value below 0 and add 0.5 to value above 0.
So I get a new dataframe like this:
Station Mean_length Diff Diff05
1 AMEL 28.1 -2.91 -3.41 (-0.5)
2 AMRU 21.1 -9.90 -13.8 (-0.5)
3 BALG 31.0 0 0.5 (+0.5)
4 BORK 30.1 -0.921 -1.421 (-0.5)
5 BUSU 22.6 -8.38 -8.88 (-0.5)
6 CADZ 28.5 2.46 2.96 (+0.5)
7 DOLL 27.9 -3.07 -3.57 (-0.5)
8 EGMO 28.3 -2.69 -3.19 (-0.5)
9 EIER 30.8 0.233 0.733 (+0.5)
10 FANO 23.1 -7.89 -8.39 (-0.5)
How can I tackle this? Is there something in dplyr possible? with the 'ifelse' function? recognizing values when they are haven the '-' in front of them....
Thank you I advance!
Another way:
df$Diff05 <- df$Diff + 0.5 * sign(df$Diff)
Station Mean_length Diff Diff05
1 AMEL 28.1 -2.910 -3.410
2 AMRU 21.1 -9.900 -10.400
3 BALG 31.0 0.000 0.000
4 BORK 30.1 -0.921 -1.421
5 BUSU 22.6 -8.380 -8.880
6 CADZ 28.5 2.460 2.960
7 DOLL 27.9 -3.070 -3.570
8 EGMO 28.3 -2.690 -3.190
9 EIER 30.8 0.233 0.733
10 FANO 23.1 -7.890 -8.390
You could also use df$Diff + (df$Diff>0) - 0.5
Does this work:
library(dplyr)
df %>% mutate(Diff05 = if_else(Diff < 0, Diff - 0.5, Diff + 0.5))
# A tibble: 10 x 4
station Mean_length Diff Diff05
<chr> <dbl> <dbl> <dbl>
1 AMEL 28.1 -2.91 -3.41
2 AMRU 21.1 -9.9 -10.4
3 BALG 31 0 0.5
4 BORK 30.1 -0.921 -1.42
5 BUSU 22.6 -8.38 -8.88
6 CADZ 28.5 2.46 2.96
7 DOLL 27.9 -3.07 -3.57
8 EGMO 28.3 -2.69 -3.19
9 EIER 30.8 0.233 0.733
10 FANO 23.1 -7.89 -8.39
The logical way
df$Diff05 <- ifelse(test = df$Diff < 0, yes = df$Diff - 0.5, no = df$Diff + 0.5)
This is my first post! I started using R about a year ago and I have learned a lot from this sub over the last few months! Thanks for all of your help so far.
Here is what I am trying to do:
• Group Data by POS
• Within each POS group, no ORG should represent more than 25% of the dataset
• If the ORG represents more than 25% of the observation(column), the value furthest from the mean should be deleted. I think this would loop until the data from that ORG are less than 25% of the observation.
I am not sure how to approach this problem as I am a not too familiar with R functions. Well, I am assuming this would require a function.
Here is the sample dataset:
print(Example)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 16.2 14.4 21.7 NA NA NA NA NA NA NA 1.32
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA 1.89
5 3 1 2.39 16.9 24.1 NA NA 1.13 1.52 1.12 NA NA 2.78
6 3 1 24.3 15.4 24.6 NA NA 1.13 1.89 1.13 NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA 0.83 1.3 0.94 1.78 2.15 1.51
8 6 1 18.7 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 23.8 24.4 39.7 NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 18.9 17.4 26.9 0.15 NA NA 1.89 2.99 NA NA 1.51
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 24.3 19.6 28.5 0.15 NA NA 1.51 1.32 NA NA 2.27
The result would look something like this:
print(Result)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 NA NA NA NA NA NA NA NA NA NA NA
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA NA
5 3 1 NA NA NA NA NA NA NA NA NA NA NA
6 3 1 NA NA NA NA NA NA NA NA NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA NA NA NA NA NA NA
8 6 1 NA NA NA NA NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 NA NA NA NA NA NA NA NA
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 NA NA NA NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 NA NA NA NA NA NA 1.89 2.99 NA NA NA
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 NA NA NA NA NA NA NA NA NA NA NA
Any advice would be appreciated. Thanks!
I have a dataframe containing multiple entries per week. It looks like this:
Week t_10 t_15 t_18 t_20 t_25 t_30
1 51.4 37.8 25.6 19.7 11.9 5.6
2 51.9 37.8 25.8 20.4 12.3 6.2
2 52.4 38.5 26.2 20.5 12.3 6.1
3 52.2 38.6 26.1 20.4 12.4 5.9
4 52.2 38.3 26.1 20.2 12.1 5.9
4 52.7 38.4 25.8 20.0 12.1 5.9
4 51.1 37.8 25.7 20.0 12.2 6.0
4 51.9 38.0 26.0 19.8 12.0 5.8
The Weeks have different amounts of entries, they range from one entry for a week to multiple (up to 4) entries a week.
I want to calculate the medians of each week and output it for all the different variables (t_10 throughout to t_30) in a new dataframe. NA cells are already omitted in the original dataframe. I have tried different approaches through the ddply function of the plyrpackage but to no avail so far.
We could use summarise_at for multiple columns
library(dplyr)
colsToKeep <- c("t_10", "t_30")
df1 %>%
group_by(Week) %>%
summarise_at(vars(colsToKeep), median)
# A tibble: 4 x 3
# Week t_10 t_30
# <int> <dbl> <dbl>
#1 1 51.40 5.60
#2 2 52.15 6.15
#3 3 52.20 5.90
#4 4 52.05 5.90
Specify variables to keep in colsToKeep and store input table in d
library(tidyverse)
colsToKeep <- c("t_10", "t_30")
gather(d, variable, value, -Week) %>%
filter(variable %in% colsToKeep) %>%
group_by(Week, variable) %>%
summarise(median = median(value))
# A tibble: 8 x 3
# Groups: Week [4]
Week variable median
<int> <chr> <dbl>
1 1 t_10 51.40
2 1 t_30 5.60
3 2 t_10 52.15
4 2 t_30 6.15
5 3 t_10 52.20
6 3 t_30 5.90
7 4 t_10 52.05
8 4 t_30 5.90
You can also use the aggregate function:
newdf <- aggregate(data = df, Week ~ . , median)
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I would like to know how to transform rows to columns for the following dataset.
School class Avg Subavg Sub
ABC 2 25.3 17.2 Geo
ABC 2 25.3 18.2 Mat
ABC 2 25.3 20.2 Fre
ABC 3 21.2 17.2 Geo
ABC 3 21.2 18.2 Mat
ABC 3 21.2 20.2 Ger
ABC 4 16.8 17.2 Ger
ABC 4 16.8 18.2 Mat
ABC 5 20.2 20.2 Fre
Expected output would be
School Std stdavg Geo mat Ger Fer
ABC 2 25.3 17.2 18.2 NA 20.2
ABC 3 21.2 17.2 18.2 20.2 NA
ABC 4 25.3 NA 18.2 17.2 NA
ABC 5 25.3 NA NA NA 20.2
I used split function, But in vain.
Thanks in advance
We can use dcast
library(data.table)
dcast(setDT(df1), School+class+Avg~Sub, value.var="Subavg")
# School class Avg Fre Geo Ger Mat
#1: ABC 2 25.3 20.2 17.2 NA 18.2
#2: ABC 3 21.2 NA 17.2 20.2 18.2
#3: ABC 4 24.8 NA NA 17.2 18.2
#4: ABC 5 24.8 20.2 NA NA NA
Or use spread from tidyr
library(tidyr)
spread(df1, Sub, Subavg)