I am trying to find way to shorten my code using dynamic naming variables & functions related with ascending & descending order. Though I can do desc but couldn't find anything for ascending. Below is the reproducible example to demonstrate my problem.
Here is the sample dataset
library(dplyr)
set.seed(100)
data <- tibble(a = runif(20, min = 0, max = 100),
b = runif(20, min = 0, max = 100),
c = runif(20, min = 0, max = 100))
Dynamically passing variable with percent rank in ascending order
current_var <- "a" # dynamic variable name
data %>%
mutate("percent_rank_{current_var}" := percent_rank(!!sym(current_var)))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.263
#> 2 25.8 71.1 86.5 0.158
#> 3 55.2 53.8 77.8 0.684
#> 4 5.64 74.9 82.7 0
#> 5 46.9 42.0 60.3 0.526
#> 6 48.4 17.1 49.1 0.579
#> 7 81.2 77.0 78.0 0.947
#> 8 37.0 88.2 88.4 0.421
#> 9 54.7 54.9 20.8 0.632
#> 10 17.0 27.8 30.7 0.0526
#> 11 62.5 48.8 33.1 0.737
#> 12 88.2 92.9 19.9 1
#> 13 28.0 34.9 23.6 0.211
#> 14 39.8 95.4 27.5 0.474
#> 15 76.3 69.5 59.1 0.895
#> 16 66.9 88.9 25.3 0.789
#> 17 20.5 18.0 12.3 0.105
#> 18 35.8 62.9 23.0 0.316
#> 19 35.9 99.0 59.8 0.368
#> 20 69.0 13.0 21.1 0.842
Dynamically passing variable with percent rank in descending order
data %>%
mutate("percent_rank_{current_var}" := percent_rank(desc(!!sym(current_var))))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.737
#> 2 25.8 71.1 86.5 0.842
#> 3 55.2 53.8 77.8 0.316
#> 4 5.64 74.9 82.7 1
#> 5 46.9 42.0 60.3 0.474
#> 6 48.4 17.1 49.1 0.421
#> 7 81.2 77.0 78.0 0.0526
#> 8 37.0 88.2 88.4 0.579
#> 9 54.7 54.9 20.8 0.368
#> 10 17.0 27.8 30.7 0.947
#> 11 62.5 48.8 33.1 0.263
#> 12 88.2 92.9 19.9 0
#> 13 28.0 34.9 23.6 0.789
#> 14 39.8 95.4 27.5 0.526
#> 15 76.3 69.5 59.1 0.105
#> 16 66.9 88.9 25.3 0.211
#> 17 20.5 18.0 12.3 0.895
#> 18 35.8 62.9 23.0 0.684
#> 19 35.9 99.0 59.8 0.632
#> 20 69.0 13.0 21.1 0.158
How to combine both into one statement? - I can do for desc but couldn't find any explicit statement for ascending order
rank_function <- desc # dynamic function for ranking
data %>%
mutate("percent_rank_{current_var}" := percent_rank(rank_function(!!sym(current_var))))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.737
#> 2 25.8 71.1 86.5 0.842
#> 3 55.2 53.8 77.8 0.316
#> 4 5.64 74.9 82.7 1
#> 5 46.9 42.0 60.3 0.474
#> 6 48.4 17.1 49.1 0.421
#> 7 81.2 77.0 78.0 0.0526
#> 8 37.0 88.2 88.4 0.579
#> 9 54.7 54.9 20.8 0.368
#> 10 17.0 27.8 30.7 0.947
#> 11 62.5 48.8 33.1 0.263
#> 12 88.2 92.9 19.9 0
#> 13 28.0 34.9 23.6 0.789
#> 14 39.8 95.4 27.5 0.526
#> 15 76.3 69.5 59.1 0.105
#> 16 66.9 88.9 25.3 0.211
#> 17 20.5 18.0 12.3 0.895
#> 18 35.8 62.9 23.0 0.684
#> 19 35.9 99.0 59.8 0.632
#> 20 69.0 13.0 21.1 0.158
Created on 2022-08-17 by the reprex package (v2.0.1)
You could compose a function to return its input:
rank_function <- function(x) x
Actually this function has been defined in base, i.e. identity.
rank_function <- identity
Also, you can explore the source code of desc:
desc
function (x) -xtfrm(x)
Apparently desc is just the opposite number of xtfrm. So you can use it for ascending ordering.
rank_function <- xtfrm
In the help document of xtfrm(x):
A generic auxiliary function that produces a numeric vector which will sort in the same order as x.
I need to run a self-made function across rows and create an output column in the same data frame (column name tt_daily). This is some made up example.
#data
data1 <- read.csv(text = "
doy,tmx,tmn,relHum,srad
148,31.3,13.8,68.3,30.4
149,31.1,17.2,62.2,30
150,30.1,16.1,69.7,20.9
151,27.3,16.2,77.1,26.1
152,33.4,18.4,65.9,27.4
153,27.2,18,70.3,26.6
154,30.3,13,71.5,28.4
155,36.2,22,62.2,28.8
156,32.9,22.2,61.1,24.9
157,30.5,16.2,63.2,27.9
158,25.7,19.3,71,18.3
159,29.1,18.3,87.2,12.7
160,28.5,20.3,70.2,24.8
")
This is the function:
# function to run row wise
tb<- 11
topt<- 30
tmax<- 42
tt<-function(tmx, tmn, tb, topt, tmax){
tmean<- (tmx + tmn) / 2
if(tmean <= tb) {t1 = 0}
if(tmean >tb & tmean <=topt) {t1 = tmean - tb}
if(tmean>topt & tmean<max) {t1 = (topt - tb) / (topt - tmax) * (tmean - tmax)}
if(tmean >= tmax) {t1 <- 0}
return(t1)
}
This is two options of what I did:
#Option 1
library(dplyr)
tt.example <- data1 %>%
mutate(tt_daily = purrr::pmap(function(tmx, tmn, tb, topt, tmax) tt))
and this is the error:
Error: Problem with mutate() column tt_daily.
i tt_daily = purrr::pmap(function(tmx, tmn, tb, topt, tmax) tt).
x argument ".f" is missing, with no default
This is the option 2:
#Option 2
tt.example <- data1 %>%
rowwise() %>%
mutate(tt_daily = tt(tmx, tmn, tb, topt, tmax))
This is the error I got:
Error: Problem with mutate() column tt_daily.
i tt_daily = tt(tmx, tmn, tb, topt, tmax).
x comparison (3) is possible only for atomic and list types
i The error occurred in row 1.
Thanks for any advice.
There is a typo in the function which should be tmax instead of max
tt<-function(tmx, tmn, tb, topt, tmax){
tmean<- (tmx + tmn) / 2
if(tmean <= tb) {t1 = 0}
if(tmean >tb & tmean <=topt) {t1 = tmean - tb}
if(tmean>topt & tmean<tmax) {t1 = (topt - tb) / (topt - tmax) * (tmean - tmax)}
if(tmean >= tmax) {t1 <- 0}
return(t1)
}
Now, we apply the function within mutate after appending the other arguments as a named list within pmap
library(dplyr)
library(purrr)
data1 %>%
mutate(tt_daily = pmap_dbl(c(across(tmx:tmn),
dplyr::lst(tb, topt, tmax)), tt))
-output
doy tmx tmn relHum srad tt_daily
1 148 31.3 13.8 68.3 30.4 11.55
2 149 31.1 17.2 62.2 30.0 13.15
3 150 30.1 16.1 69.7 20.9 12.10
4 151 27.3 16.2 77.1 26.1 10.75
5 152 33.4 18.4 65.9 27.4 14.90
6 153 27.2 18.0 70.3 26.6 11.60
7 154 30.3 13.0 71.5 28.4 10.65
8 155 36.2 22.0 62.2 28.8 18.10
9 156 32.9 22.2 61.1 24.9 16.55
10 157 30.5 16.2 63.2 27.9 12.35
11 158 25.7 19.3 71.0 18.3 11.50
12 159 29.1 18.3 87.2 12.7 12.70
13 160 28.5 20.3 70.2 24.8 13.40
Or using rowwise
data1 %>%
rowwise %>%
mutate(tt_daily = tt(tmx, tmn, tb, topt, tmax)) %>%
ungroup
-output
# A tibble: 13 x 6
doy tmx tmn relHum srad tt_daily
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 148 31.3 13.8 68.3 30.4 11.6
2 149 31.1 17.2 62.2 30 13.2
3 150 30.1 16.1 69.7 20.9 12.1
4 151 27.3 16.2 77.1 26.1 10.8
5 152 33.4 18.4 65.9 27.4 14.9
6 153 27.2 18 70.3 26.6 11.6
7 154 30.3 13 71.5 28.4 10.6
8 155 36.2 22 62.2 28.8 18.1
9 156 32.9 22.2 61.1 24.9 16.5
10 157 30.5 16.2 63.2 27.9 12.4
11 158 25.7 19.3 71 18.3 11.5
12 159 29.1 18.3 87.2 12.7 12.7
13 160 28.5 20.3 70.2 24.8 13.4
If we want to add a new column, then it may be better to either return a list or tibble in 'tt' function
tt<-function(tmx, tmn, tb, topt, tmax){
tmean<- (tmx + tmn) / 2
if(tmean <= tb) {t1 = 0}
if(tmean >tb & tmean <=topt) {t1 = tmean - tb}
if(tmean>topt & tmean<tmax) {t1 = (topt - tb) / (topt - tmax) * (tmean - tmax)}
if(tmean >= tmax) {t1 <- 0}
return(tibble(tt_daily = t1, tmean = tmean))
}
Now, we wrap the contents in a list and unnest the output column
library(tidyr)
data1 %>%
rowwise %>%
mutate(out = list(tt(tmx, tmn, tb, topt, tmax))) %>%
ungroup %>%
unnest_wider(c(out))
# A tibble: 13 x 7
doy tmx tmn relHum srad tt_daily tmean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 148 31.3 13.8 68.3 30.4 11.6 22.6
2 149 31.1 17.2 62.2 30 13.2 24.2
3 150 30.1 16.1 69.7 20.9 12.1 23.1
4 151 27.3 16.2 77.1 26.1 10.8 21.8
5 152 33.4 18.4 65.9 27.4 14.9 25.9
6 153 27.2 18 70.3 26.6 11.6 22.6
7 154 30.3 13 71.5 28.4 10.6 21.6
8 155 36.2 22 62.2 28.8 18.1 29.1
9 156 32.9 22.2 61.1 24.9 16.5 27.6
10 157 30.5 16.2 63.2 27.9 12.4 23.4
11 158 25.7 19.3 71 18.3 11.5 22.5
12 159 29.1 18.3 87.2 12.7 12.7 23.7
13 160 28.5 20.3 70.2 24.8 13.4 24.4
What I am wanting to do is to store and combine the output of veh.velocity into a new dataframe y for each iteration. I understand it is best to first set up an empty dataframe and then combine the columns of data at the end. Different iterations also have a different amount of rows. It is possible to just consider the first 20? Very sorry if there are several issues and misconceptions below, I only started programming a few months ago. Thanks
CSV File: https://drive.google.com/file/d/1tMOz_yM-WenSOlF3UK6UatniwtFI7kzf/view?usp=sharing
#Time series
#This programme evaluates each vehicles speed behaviour w.r.t time.
library(ggplot2)
library(fpc)
library(factoextra)
library(readr)
library(plotly)
library(dplyr)
library(fpp2)
#Clear all variables in workspace
rm(list=ls())
#Importing data
df <- read_csv("01_tracks.csv")
#Preparing data
df1 <- filter(df,laneId == 5, width <= 6) #Filtering to only lane 5 and no trucks
#Creating empty lists
y <- data.frame()
#Loop to plot time series for only filtered vehicle id's
for(i in unique(df1$id)[1:6]) { #Only considering first 6 vehicles for now due to long computation time
print(i) #List of vehicle id's
veh <- filter(df1,id == i) #New dataframe for vehicles/id's which are in lanes 5
timeseries <- ts(veh[,7],start = 1) #Declare as time series data
plot(autoplot(timeseries) + ggtitle(i) + ylab("X Velocity")) #Plotting time series
veh.velocity <- select(veh,xVelocity) #New dataframe for only vehicle id and its velocity
y <- cbind.data.frame(y,veh.velocity)
}
If you just want to plot a time series of the individual vehicle IDs, I'm not sure there's a reason to use a loop. For example, you could easily make a time variable and plot with ggplot2:
library(dplyr)
library(ggplot2)
df1 %>%
group_by(id) %>%
mutate(time = 1:n()) %>%
ggplot(aes(x = time, y = xVelocity, color = as.factor(id))) +
geom_line(show.legend = FALSE, alpha = 0.5)
And you could use pivot_wider from tidyr to reshape the data:
library(tidyr)
result <- df1 %>%
group_by(id) %>%
mutate(time = 1:n()) %>%
dplyr::select(time, xVelocity) %>%
pivot_wider(id_cols = time, values_from = xVelocity,
names_prefix = "Veh.", names_from = id)
result
# A tibble: 414 x 287
time Veh.1 Veh.3 Veh.7 Veh.11 Veh.12 Veh.14 Veh.15 Veh.25 Veh.31 Veh.47 Veh.50 Veh.53 Veh.55 Veh.59
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 40.8 35.7 32.6 24.7 35.8 41.5 35.7 37.9 39.2 39.2 40.3 39.5 33.0 38.2
2 2 40.9 35.7 32.6 24.8 35.8 41.5 35.7 37.9 39.2 39.2 40.3 39.5 33.0 38.2
3 3 40.9 35.7 32.6 24.8 35.8 41.5 35.7 38.0 39.2 39.2 40.3 39.6 33.0 38.2
4 4 40.9 35.7 32.6 24.8 35.8 41.5 35.7 38.0 39.2 39.2 40.3 39.6 33.1 38.2
5 5 40.9 35.7 32.6 24.8 35.8 41.5 35.7 38.0 39.2 39.2 40.3 39.6 33.1 38.1
6 6 40.9 35.7 32.6 24.9 35.8 41.6 35.7 38 39.2 39.2 40.4 39.6 33.1 38.1
7 7 40.9 35.7 32.7 24.9 35.8 41.6 35.7 38.0 39.2 39.2 40.4 39.6 33.1 38.1
8 8 40.9 35.7 32.7 24.9 35.8 41.6 35.7 38.0 39.2 39.3 40.4 39.6 33.1 38.1
9 9 41.0 35.8 32.7 25.0 35.8 41.6 35.7 38.0 39.2 39.3 40.4 39.6 33.1 38.1
10 10 41.0 35.8 32.7 25 35.8 41.6 35.7 38.1 39.2 39.3 40.4 39.7 33.1 38.1
# … with 404 more rows, and 272 more variables:
I am trying to calculate SPEI values using SPEI package and Hargreaves method. I want to automate the process so that I can calculate SPEI for all 6 stations in one go and save them to a new file spei.3.
SPEI is calculated in three steps. First, we calculate PET values (spei_pet), which is then subtracted from Precipitation value to calculate climatic water balance (spei_cwbal). The CWBAL value is then used in SPEI function from the package of the same name with a scale to calculate SPEI values.
I am new to R and very new to tidyverse, but the internet says they are easier to work on. I wrote the code below to do my task. But I am surely missing something (or maybe, many things) because the code throws an error. Please help me identify error in my code, and help me get a solution.
library(tidyverse)
library(SPEI)
file_path = "I:/Proj/Excel sheets - climate/SPI/heatmap/spei_forecast_data.xlsx"
file_forecast = openxlsx::read.xlsx(file_path)
##spei calculation
spei.scale = c(3, 6, 9, 12, 15, 24)
stations = c(1:3, 5:7)
lat = c(23.29, 23.08, 22.95, 22.62, 22.43, 22.40)
lat.fn = function(i) {
if (i <= 3)
lat.fn = lat[i]
else if (i == 5)
lat.fn = lat[4]
else if (i == 6)
lat.fn = lat[5]
else if (i == 7)
lat.fn = lat[6]
}
for ( i in stations) {
file_forecast %>%
mutate(spei_pet[i] <- hargreaves(Tmin = file_forecast$paste("tmin", i),
Tmax = file_forecast$paste("tmax", i),
Pre = file_forecast$paste("p", i),
lat = lat.fn[i])) %>%
mutate(spei_cwbal[i] <- spei_pet[[i]] - file_forecast$paste("p", i)) %>%
mutate(spei.3[i] <- spei(spei_cwbal[[i]], scale = 3))
}
It throws an error
Error in as.matrix(Tmin) : attempt to apply non-function
lat.fn[i] also throws an error, which gets rectified if I use no i. But I need to use some kind of function so that lat.fn takes different value depending on i.
Error in lat.fn[i] : object of type 'closure' is not subsettable
Thanks.
Edit: The data is in the form of a data.frame. I converted it into a tibble to give an idea of what it looks like.
> file_forecast
# A tibble: 960 x 20
Month p7 p6 p5 p3 p2 p1 tmax7 tmax6 tmax5 tmax3 tmax2 tmax1 tmin7 tmin6 tmin5 tmin3 tmin2 tmin1
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Jan 0.162 0.185 0.293 0.436 0.529 0.658 26.4 26.5 26.2 25.9 25.7 24.9 9.57 9.75 10.0 10.4 9.94 9.77
2 Feb 0.207 0.305 0.250 0.260 0.240 0.186 32.2 32.2 32.1 31.9 31.8 30.9 12.4 12.7 12.7 13.0 12.2 11.9
3 Mar 0.511 0.650 0.602 0.636 0.625 0.501 37.3 37.1 37.1 37.0 36.9 36.1 18.7 19.3 18.3 18.0 17.3 16.9
4 Apr 0.976 1.12 1.05 1.12 1.17 1.16 39.5 39.2 39.6 39.5 39.5 38.8 22.8 23.2 22.5 22.2 21.7 20.8
5 May 3.86 4.12 3.76 4.29 4.15 3.84 38.2 37.9 38.3 38.1 38.2 37.6 25.1 25.4 24.9 24.7 24.5 23.8
6 Jun 7.31 8.27 7.20 8.51 9.14 8.76 38.0 37.6 38.1 38.0 38.0 37.7 27.2 27.3 26.9 26.7 26.6 26.1
7 Jul 13.9 15.6 13.2 17.0 19.1 17.8 33.9 33.6 34.0 33.9 33.8 33.5 26.8 26.9 26.6 26.5 26.4 26.0
8 Aug 15.2 17.2 14.4 18.6 20.1 18.4 32.6 32.4 32.7 32.4 32.3 32.0 26.2 26.4 26.1 25.9 25.9 25.4
9 Sep 11.4 11.9 10.5 12.9 13.2 13.1 31.9 31.9 31.8 31.5 31.5 30.9 24.4 24.6 24.3 24.3 24.3 23.7
10 Oct 5.19 5.76 4.81 5.40 5.44 5.04 29.8 30.0 29.6 29.3 29.3 28.6 20.9 21.1 20.8 20.9 20.8 20.2
# ... with 950 more rows, and 1 more variable: year <dbl>
I'm trying to get the mean for each sub-dataset in my dataset, but my output just gives me the mean for the whole dataset for each sub-dataset. I think it might be an issue with the way my dataset is structured: the data consists of x and y observations for 13 sub-datasets that have the following names:dino, away, h_lines, v_lines, x_shape, star, high_lines, dots, circle, bullseye, slant_up, slant_down, wide_lines. The sub-dataset names are listed in a column called "dataset" (see example picture below).
dataset snippet
I'm using the dplyr functions group_by() and summarize(). I've seen so many examples where this works, so I'm not sure where I'm going wrong.
This is what I've tried
dinodata%>%
dplyr::group_by(dataset)%>%
dplyr::summarize(mean_x = mean(x),
mean_y = mean(y),
sd_x = sd(x),
sd_y = sd(y),
correlation = cor(x,y)
)
and this is the output
# A tibble: 13 x 6
dataset mean_x mean_y sd_x sd_y correlation
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 away 54.3 47.8 16.8 26.9 -0.0641
2 bullseye 54.3 47.8 16.8 26.9 -0.0686
3 circle 54.3 47.8 16.8 26.9 -0.0683
4 dino 54.3 47.8 16.8 26.9 -0.0645
5 dots 54.3 47.8 16.8 26.9 -0.0603
6 h_lines 54.3 47.8 16.8 26.9 -0.0617
7 high_lines 54.3 47.8 16.8 26.9 -0.0685
8 slant_down 54.3 47.8 16.8 26.9 -0.0690
9 slant_up 54.3 47.8 16.8 26.9 -0.0686
10 star 54.3 47.8 16.8 26.9 -0.0630
11 v_lines 54.3 47.8 16.8 26.9 -0.0694
12 wide_lines 54.3 47.8 16.8 26.9 -0.0666
13 x_shape 54.3 47.8 16.8 26.9 -0.0656
The means and standard deviations are calculating the same as if I did mean(dinodata$x) and sd(dinodata$x) which is not what I want. I want the mean for each sub-dataset for x and y, etc.