I'm trying to get the mean for each sub-dataset in my dataset, but my output just gives me the mean for the whole dataset for each sub-dataset. I think it might be an issue with the way my dataset is structured: the data consists of x and y observations for 13 sub-datasets that have the following names:dino, away, h_lines, v_lines, x_shape, star, high_lines, dots, circle, bullseye, slant_up, slant_down, wide_lines. The sub-dataset names are listed in a column called "dataset" (see example picture below).
dataset snippet
I'm using the dplyr functions group_by() and summarize(). I've seen so many examples where this works, so I'm not sure where I'm going wrong.
This is what I've tried
dinodata%>%
dplyr::group_by(dataset)%>%
dplyr::summarize(mean_x = mean(x),
mean_y = mean(y),
sd_x = sd(x),
sd_y = sd(y),
correlation = cor(x,y)
)
and this is the output
# A tibble: 13 x 6
dataset mean_x mean_y sd_x sd_y correlation
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 away 54.3 47.8 16.8 26.9 -0.0641
2 bullseye 54.3 47.8 16.8 26.9 -0.0686
3 circle 54.3 47.8 16.8 26.9 -0.0683
4 dino 54.3 47.8 16.8 26.9 -0.0645
5 dots 54.3 47.8 16.8 26.9 -0.0603
6 h_lines 54.3 47.8 16.8 26.9 -0.0617
7 high_lines 54.3 47.8 16.8 26.9 -0.0685
8 slant_down 54.3 47.8 16.8 26.9 -0.0690
9 slant_up 54.3 47.8 16.8 26.9 -0.0686
10 star 54.3 47.8 16.8 26.9 -0.0630
11 v_lines 54.3 47.8 16.8 26.9 -0.0694
12 wide_lines 54.3 47.8 16.8 26.9 -0.0666
13 x_shape 54.3 47.8 16.8 26.9 -0.0656
The means and standard deviations are calculating the same as if I did mean(dinodata$x) and sd(dinodata$x) which is not what I want. I want the mean for each sub-dataset for x and y, etc.
Related
I tried to create longer format of a dataset and I am getting a transformed dataframe. I have seen that the column names are in reverse format. But don't know how to fix this. I want x, y are column names. any help?
library(pacman)
p_load(tidyverse, purrr, datasauRus)
datasaurus_dozen_wide %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.*)_(.)")
#> # A tibble: 284 × 14
#> set away bullseye circle dino dots h_lines high_…¹ slant…² slant…³ star
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 x 32.3 51.2 56.0 55.4 51.1 53.4 57.6 52.9 47.7 58.2
#> 2 y 61.4 83.3 79.3 97.2 90.9 90.2 83.9 97.3 95.2 91.9
#> 3 x 53.4 59.0 50.0 51.5 50.5 52.8 51.3 59.0 44.6 58.2
#> 4 y 26.2 85.5 79.0 96.0 89.1 90.1 82.8 93.6 93.1 92.2
#> 5 x 63.9 51.9 51.3 46.2 50.2 47.1 50.8 56.4 43.9 58.7
#> 6 y 30.8 85.8 82.4 94.5 85.5 90.5 76.8 96.3 94.1 90.3
#> 7 x 70.3 48.2 51.2 42.8 50.1 42.4 37.0 37.8 41.6 57.3
#> 8 y 82.5 85.0 79.2 91.4 83.1 89.5 82.0 94.4 90.3 89.9
#> 9 x 34.1 41.7 44.4 40.8 50.6 42.7 42.9 39.9 49.2 58.1
#> 10 y 45.7 84.0 78.2 88.3 82.9 90.4 80.2 90.6 96.6 92.0
#> # … with 274 more rows, 3 more variables: v_lines <dbl>, wide_lines <dbl>,
#> # x_shape <dbl>, and abbreviated variable names ¹high_lines, ²slant_down,
#> # ³slant_up
Created on 2022-10-10 with reprex v2.0.2
You could achieve your desired result by simply switching ".value" and "set" in the names_to argument:
library(tidyr)
library(datasauRus)
datasaurus_dozen_long <- datasaurus_dozen_wide %>%
pivot_longer(everything(),
names_to = c("set", ".value"),
names_pattern = "(.*)_(.)")
head(datasaurus_dozen_long)
#> # A tibble: 6 × 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 away 32.3 61.4
#> 2 bullseye 51.2 83.3
#> 3 circle 56.0 79.3
#> 4 dino 55.4 97.2
#> 5 dots 51.1 90.9
#> 6 h_lines 53.4 90.2
library(ggplot2)
ggplot(datasaurus_dozen_long, aes(x, y)) +
geom_point() +
facet_wrap(~set)
I need to run a self-made function across rows and create an output column in the same data frame (column name tt_daily). This is some made up example.
#data
data1 <- read.csv(text = "
doy,tmx,tmn,relHum,srad
148,31.3,13.8,68.3,30.4
149,31.1,17.2,62.2,30
150,30.1,16.1,69.7,20.9
151,27.3,16.2,77.1,26.1
152,33.4,18.4,65.9,27.4
153,27.2,18,70.3,26.6
154,30.3,13,71.5,28.4
155,36.2,22,62.2,28.8
156,32.9,22.2,61.1,24.9
157,30.5,16.2,63.2,27.9
158,25.7,19.3,71,18.3
159,29.1,18.3,87.2,12.7
160,28.5,20.3,70.2,24.8
")
This is the function:
# function to run row wise
tb<- 11
topt<- 30
tmax<- 42
tt<-function(tmx, tmn, tb, topt, tmax){
tmean<- (tmx + tmn) / 2
if(tmean <= tb) {t1 = 0}
if(tmean >tb & tmean <=topt) {t1 = tmean - tb}
if(tmean>topt & tmean<max) {t1 = (topt - tb) / (topt - tmax) * (tmean - tmax)}
if(tmean >= tmax) {t1 <- 0}
return(t1)
}
This is two options of what I did:
#Option 1
library(dplyr)
tt.example <- data1 %>%
mutate(tt_daily = purrr::pmap(function(tmx, tmn, tb, topt, tmax) tt))
and this is the error:
Error: Problem with mutate() column tt_daily.
i tt_daily = purrr::pmap(function(tmx, tmn, tb, topt, tmax) tt).
x argument ".f" is missing, with no default
This is the option 2:
#Option 2
tt.example <- data1 %>%
rowwise() %>%
mutate(tt_daily = tt(tmx, tmn, tb, topt, tmax))
This is the error I got:
Error: Problem with mutate() column tt_daily.
i tt_daily = tt(tmx, tmn, tb, topt, tmax).
x comparison (3) is possible only for atomic and list types
i The error occurred in row 1.
Thanks for any advice.
There is a typo in the function which should be tmax instead of max
tt<-function(tmx, tmn, tb, topt, tmax){
tmean<- (tmx + tmn) / 2
if(tmean <= tb) {t1 = 0}
if(tmean >tb & tmean <=topt) {t1 = tmean - tb}
if(tmean>topt & tmean<tmax) {t1 = (topt - tb) / (topt - tmax) * (tmean - tmax)}
if(tmean >= tmax) {t1 <- 0}
return(t1)
}
Now, we apply the function within mutate after appending the other arguments as a named list within pmap
library(dplyr)
library(purrr)
data1 %>%
mutate(tt_daily = pmap_dbl(c(across(tmx:tmn),
dplyr::lst(tb, topt, tmax)), tt))
-output
doy tmx tmn relHum srad tt_daily
1 148 31.3 13.8 68.3 30.4 11.55
2 149 31.1 17.2 62.2 30.0 13.15
3 150 30.1 16.1 69.7 20.9 12.10
4 151 27.3 16.2 77.1 26.1 10.75
5 152 33.4 18.4 65.9 27.4 14.90
6 153 27.2 18.0 70.3 26.6 11.60
7 154 30.3 13.0 71.5 28.4 10.65
8 155 36.2 22.0 62.2 28.8 18.10
9 156 32.9 22.2 61.1 24.9 16.55
10 157 30.5 16.2 63.2 27.9 12.35
11 158 25.7 19.3 71.0 18.3 11.50
12 159 29.1 18.3 87.2 12.7 12.70
13 160 28.5 20.3 70.2 24.8 13.40
Or using rowwise
data1 %>%
rowwise %>%
mutate(tt_daily = tt(tmx, tmn, tb, topt, tmax)) %>%
ungroup
-output
# A tibble: 13 x 6
doy tmx tmn relHum srad tt_daily
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 148 31.3 13.8 68.3 30.4 11.6
2 149 31.1 17.2 62.2 30 13.2
3 150 30.1 16.1 69.7 20.9 12.1
4 151 27.3 16.2 77.1 26.1 10.8
5 152 33.4 18.4 65.9 27.4 14.9
6 153 27.2 18 70.3 26.6 11.6
7 154 30.3 13 71.5 28.4 10.6
8 155 36.2 22 62.2 28.8 18.1
9 156 32.9 22.2 61.1 24.9 16.5
10 157 30.5 16.2 63.2 27.9 12.4
11 158 25.7 19.3 71 18.3 11.5
12 159 29.1 18.3 87.2 12.7 12.7
13 160 28.5 20.3 70.2 24.8 13.4
If we want to add a new column, then it may be better to either return a list or tibble in 'tt' function
tt<-function(tmx, tmn, tb, topt, tmax){
tmean<- (tmx + tmn) / 2
if(tmean <= tb) {t1 = 0}
if(tmean >tb & tmean <=topt) {t1 = tmean - tb}
if(tmean>topt & tmean<tmax) {t1 = (topt - tb) / (topt - tmax) * (tmean - tmax)}
if(tmean >= tmax) {t1 <- 0}
return(tibble(tt_daily = t1, tmean = tmean))
}
Now, we wrap the contents in a list and unnest the output column
library(tidyr)
data1 %>%
rowwise %>%
mutate(out = list(tt(tmx, tmn, tb, topt, tmax))) %>%
ungroup %>%
unnest_wider(c(out))
# A tibble: 13 x 7
doy tmx tmn relHum srad tt_daily tmean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 148 31.3 13.8 68.3 30.4 11.6 22.6
2 149 31.1 17.2 62.2 30 13.2 24.2
3 150 30.1 16.1 69.7 20.9 12.1 23.1
4 151 27.3 16.2 77.1 26.1 10.8 21.8
5 152 33.4 18.4 65.9 27.4 14.9 25.9
6 153 27.2 18 70.3 26.6 11.6 22.6
7 154 30.3 13 71.5 28.4 10.6 21.6
8 155 36.2 22 62.2 28.8 18.1 29.1
9 156 32.9 22.2 61.1 24.9 16.5 27.6
10 157 30.5 16.2 63.2 27.9 12.4 23.4
11 158 25.7 19.3 71 18.3 11.5 22.5
12 159 29.1 18.3 87.2 12.7 12.7 23.7
13 160 28.5 20.3 70.2 24.8 13.4 24.4
What I am wanting to do is to store and combine the output of veh.velocity into a new dataframe y for each iteration. I understand it is best to first set up an empty dataframe and then combine the columns of data at the end. Different iterations also have a different amount of rows. It is possible to just consider the first 20? Very sorry if there are several issues and misconceptions below, I only started programming a few months ago. Thanks
CSV File: https://drive.google.com/file/d/1tMOz_yM-WenSOlF3UK6UatniwtFI7kzf/view?usp=sharing
#Time series
#This programme evaluates each vehicles speed behaviour w.r.t time.
library(ggplot2)
library(fpc)
library(factoextra)
library(readr)
library(plotly)
library(dplyr)
library(fpp2)
#Clear all variables in workspace
rm(list=ls())
#Importing data
df <- read_csv("01_tracks.csv")
#Preparing data
df1 <- filter(df,laneId == 5, width <= 6) #Filtering to only lane 5 and no trucks
#Creating empty lists
y <- data.frame()
#Loop to plot time series for only filtered vehicle id's
for(i in unique(df1$id)[1:6]) { #Only considering first 6 vehicles for now due to long computation time
print(i) #List of vehicle id's
veh <- filter(df1,id == i) #New dataframe for vehicles/id's which are in lanes 5
timeseries <- ts(veh[,7],start = 1) #Declare as time series data
plot(autoplot(timeseries) + ggtitle(i) + ylab("X Velocity")) #Plotting time series
veh.velocity <- select(veh,xVelocity) #New dataframe for only vehicle id and its velocity
y <- cbind.data.frame(y,veh.velocity)
}
If you just want to plot a time series of the individual vehicle IDs, I'm not sure there's a reason to use a loop. For example, you could easily make a time variable and plot with ggplot2:
library(dplyr)
library(ggplot2)
df1 %>%
group_by(id) %>%
mutate(time = 1:n()) %>%
ggplot(aes(x = time, y = xVelocity, color = as.factor(id))) +
geom_line(show.legend = FALSE, alpha = 0.5)
And you could use pivot_wider from tidyr to reshape the data:
library(tidyr)
result <- df1 %>%
group_by(id) %>%
mutate(time = 1:n()) %>%
dplyr::select(time, xVelocity) %>%
pivot_wider(id_cols = time, values_from = xVelocity,
names_prefix = "Veh.", names_from = id)
result
# A tibble: 414 x 287
time Veh.1 Veh.3 Veh.7 Veh.11 Veh.12 Veh.14 Veh.15 Veh.25 Veh.31 Veh.47 Veh.50 Veh.53 Veh.55 Veh.59
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 40.8 35.7 32.6 24.7 35.8 41.5 35.7 37.9 39.2 39.2 40.3 39.5 33.0 38.2
2 2 40.9 35.7 32.6 24.8 35.8 41.5 35.7 37.9 39.2 39.2 40.3 39.5 33.0 38.2
3 3 40.9 35.7 32.6 24.8 35.8 41.5 35.7 38.0 39.2 39.2 40.3 39.6 33.0 38.2
4 4 40.9 35.7 32.6 24.8 35.8 41.5 35.7 38.0 39.2 39.2 40.3 39.6 33.1 38.2
5 5 40.9 35.7 32.6 24.8 35.8 41.5 35.7 38.0 39.2 39.2 40.3 39.6 33.1 38.1
6 6 40.9 35.7 32.6 24.9 35.8 41.6 35.7 38 39.2 39.2 40.4 39.6 33.1 38.1
7 7 40.9 35.7 32.7 24.9 35.8 41.6 35.7 38.0 39.2 39.2 40.4 39.6 33.1 38.1
8 8 40.9 35.7 32.7 24.9 35.8 41.6 35.7 38.0 39.2 39.3 40.4 39.6 33.1 38.1
9 9 41.0 35.8 32.7 25.0 35.8 41.6 35.7 38.0 39.2 39.3 40.4 39.6 33.1 38.1
10 10 41.0 35.8 32.7 25 35.8 41.6 35.7 38.1 39.2 39.3 40.4 39.7 33.1 38.1
# … with 404 more rows, and 272 more variables:
I am trying to calculate SPEI values using SPEI package and Hargreaves method. I want to automate the process so that I can calculate SPEI for all 6 stations in one go and save them to a new file spei.3.
SPEI is calculated in three steps. First, we calculate PET values (spei_pet), which is then subtracted from Precipitation value to calculate climatic water balance (spei_cwbal). The CWBAL value is then used in SPEI function from the package of the same name with a scale to calculate SPEI values.
I am new to R and very new to tidyverse, but the internet says they are easier to work on. I wrote the code below to do my task. But I am surely missing something (or maybe, many things) because the code throws an error. Please help me identify error in my code, and help me get a solution.
library(tidyverse)
library(SPEI)
file_path = "I:/Proj/Excel sheets - climate/SPI/heatmap/spei_forecast_data.xlsx"
file_forecast = openxlsx::read.xlsx(file_path)
##spei calculation
spei.scale = c(3, 6, 9, 12, 15, 24)
stations = c(1:3, 5:7)
lat = c(23.29, 23.08, 22.95, 22.62, 22.43, 22.40)
lat.fn = function(i) {
if (i <= 3)
lat.fn = lat[i]
else if (i == 5)
lat.fn = lat[4]
else if (i == 6)
lat.fn = lat[5]
else if (i == 7)
lat.fn = lat[6]
}
for ( i in stations) {
file_forecast %>%
mutate(spei_pet[i] <- hargreaves(Tmin = file_forecast$paste("tmin", i),
Tmax = file_forecast$paste("tmax", i),
Pre = file_forecast$paste("p", i),
lat = lat.fn[i])) %>%
mutate(spei_cwbal[i] <- spei_pet[[i]] - file_forecast$paste("p", i)) %>%
mutate(spei.3[i] <- spei(spei_cwbal[[i]], scale = 3))
}
It throws an error
Error in as.matrix(Tmin) : attempt to apply non-function
lat.fn[i] also throws an error, which gets rectified if I use no i. But I need to use some kind of function so that lat.fn takes different value depending on i.
Error in lat.fn[i] : object of type 'closure' is not subsettable
Thanks.
Edit: The data is in the form of a data.frame. I converted it into a tibble to give an idea of what it looks like.
> file_forecast
# A tibble: 960 x 20
Month p7 p6 p5 p3 p2 p1 tmax7 tmax6 tmax5 tmax3 tmax2 tmax1 tmin7 tmin6 tmin5 tmin3 tmin2 tmin1
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Jan 0.162 0.185 0.293 0.436 0.529 0.658 26.4 26.5 26.2 25.9 25.7 24.9 9.57 9.75 10.0 10.4 9.94 9.77
2 Feb 0.207 0.305 0.250 0.260 0.240 0.186 32.2 32.2 32.1 31.9 31.8 30.9 12.4 12.7 12.7 13.0 12.2 11.9
3 Mar 0.511 0.650 0.602 0.636 0.625 0.501 37.3 37.1 37.1 37.0 36.9 36.1 18.7 19.3 18.3 18.0 17.3 16.9
4 Apr 0.976 1.12 1.05 1.12 1.17 1.16 39.5 39.2 39.6 39.5 39.5 38.8 22.8 23.2 22.5 22.2 21.7 20.8
5 May 3.86 4.12 3.76 4.29 4.15 3.84 38.2 37.9 38.3 38.1 38.2 37.6 25.1 25.4 24.9 24.7 24.5 23.8
6 Jun 7.31 8.27 7.20 8.51 9.14 8.76 38.0 37.6 38.1 38.0 38.0 37.7 27.2 27.3 26.9 26.7 26.6 26.1
7 Jul 13.9 15.6 13.2 17.0 19.1 17.8 33.9 33.6 34.0 33.9 33.8 33.5 26.8 26.9 26.6 26.5 26.4 26.0
8 Aug 15.2 17.2 14.4 18.6 20.1 18.4 32.6 32.4 32.7 32.4 32.3 32.0 26.2 26.4 26.1 25.9 25.9 25.4
9 Sep 11.4 11.9 10.5 12.9 13.2 13.1 31.9 31.9 31.8 31.5 31.5 30.9 24.4 24.6 24.3 24.3 24.3 23.7
10 Oct 5.19 5.76 4.81 5.40 5.44 5.04 29.8 30.0 29.6 29.3 29.3 28.6 20.9 21.1 20.8 20.9 20.8 20.2
# ... with 950 more rows, and 1 more variable: year <dbl>
I have a dataframe containing multiple entries per week. It looks like this:
Week t_10 t_15 t_18 t_20 t_25 t_30
1 51.4 37.8 25.6 19.7 11.9 5.6
2 51.9 37.8 25.8 20.4 12.3 6.2
2 52.4 38.5 26.2 20.5 12.3 6.1
3 52.2 38.6 26.1 20.4 12.4 5.9
4 52.2 38.3 26.1 20.2 12.1 5.9
4 52.7 38.4 25.8 20.0 12.1 5.9
4 51.1 37.8 25.7 20.0 12.2 6.0
4 51.9 38.0 26.0 19.8 12.0 5.8
The Weeks have different amounts of entries, they range from one entry for a week to multiple (up to 4) entries a week.
I want to calculate the medians of each week and output it for all the different variables (t_10 throughout to t_30) in a new dataframe. NA cells are already omitted in the original dataframe. I have tried different approaches through the ddply function of the plyrpackage but to no avail so far.
We could use summarise_at for multiple columns
library(dplyr)
colsToKeep <- c("t_10", "t_30")
df1 %>%
group_by(Week) %>%
summarise_at(vars(colsToKeep), median)
# A tibble: 4 x 3
# Week t_10 t_30
# <int> <dbl> <dbl>
#1 1 51.40 5.60
#2 2 52.15 6.15
#3 3 52.20 5.90
#4 4 52.05 5.90
Specify variables to keep in colsToKeep and store input table in d
library(tidyverse)
colsToKeep <- c("t_10", "t_30")
gather(d, variable, value, -Week) %>%
filter(variable %in% colsToKeep) %>%
group_by(Week, variable) %>%
summarise(median = median(value))
# A tibble: 8 x 3
# Groups: Week [4]
Week variable median
<int> <chr> <dbl>
1 1 t_10 51.40
2 1 t_30 5.60
3 2 t_10 52.15
4 2 t_30 6.15
5 3 t_10 52.20
6 3 t_30 5.90
7 4 t_10 52.05
8 4 t_30 5.90
You can also use the aggregate function:
newdf <- aggregate(data = df, Week ~ . , median)