How to reference "cells" within a column in R? - r

I'm trying to calculate numeric ranges based on the moving average of a column of data. I have found a way to use caTools::runmean to produce a column of moving averages, and I know how to work with this in Excel to produce the columns I want, but I would love to know a way to do all of this in one R script.
Here is my simplified reproducible example for R.
library(tidyverse)
library(caTools)
data <- as_tibble(data.frame(
Index = as.integer(c(18,19,21,22,23,25,26,29)),
mydbl = c(8.905,13.31,15.739,17.544,19.054,20.393,21.623,22.764)))
data <- data %>%
mutate(avg = runmean(mydbl,
k = 2,
alg = "exact",
endrule = "NA"))
This tibble will look like this:
> data
# A tibble: 8 x 3
Index mydbl avg
<int> <dbl> <dbl>
1 18 8.90 NA
2 19 13.3 11.1
3 21 15.7 14.5
4 22 17.5 16.6
5 23 19.1 18.3
6 25 20.4 19.7
7 26 21.6 21.0
8 29 22.8 22.2
To produce the remaining data I want, I exported this to Excel with write_csv(data,...) and the final table is shown below. The first value in dbl_i is the formula =B2-ABS(C3-B2) (the difference between mydbl and the next avg subtracted from mydbl to create an equidistant lower limit). The last value in dbl_f is the formula =B9+ABS(C9-B9) (the difference between mydbl and the avg added to mydbl to create an equidistant upper limit). The other values in the two columns are just direct references to the avg column.
Index mydbl avg dbl_i dbl_f
18 8.905 NA 6.7025 11.1075
19 13.31 11.1075 11.1075 14.5245
21 15.739 14.5245 14.5245 16.6415
22 17.544 16.6415 16.6415 18.299
23 19.054 18.299 18.299 19.7235
25 20.393 19.7235 19.7235 21.008
26 21.623 21.008 21.008 22.1935
29 22.764 22.1935 22.1935 23.3345
Yes, the dbl_i is just the avg column but with the first value being =B2-abs(C3-B2). And the dbl_f column is the same as the avg column except it's moved up one, and the final value is =B9+abs(C9=B9). Ultimately it seems the real problem lies in finding a way to reproduce the Excel calculations D2=B2-ABS(C3-B2) and E9=B9+ABS(C9-B9).
Does anyone know how they would reproduce these calculations in R? I was looking for a way to create a formula in R that could be the equivalent of B2-ABS(C3-B2), but could not find one, unless I create a matrix instead. Do I have to create a matrix?
Thanks for your time.

data %>%
mutate(
avg = zoo::rollmean(mydbl, 2, align="right", fill=NA),
dbl_i = if_else(row_number() == 1L, mydbl - abs(lead(avg) - mydbl), avg),
dbl_f = if_else(row_number() == n(), mydbl + abs(avg - mydbl), lead(avg))
)
# # A tibble: 8 x 5
# Index mydbl avg dbl_i dbl_f
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 18 8.90 NA 6.70 11.1
# 2 19 13.3 11.1 11.1 14.5
# 3 21 15.7 14.5 14.5 16.6
# 4 22 17.5 16.6 16.6 18.3
# 5 23 19.1 18.3 18.3 19.7
# 6 25 20.4 19.7 19.7 21.0
# 7 26 21.6 21.0 21.0 22.2
# 8 29 22.8 22.2 22.2 23.3
Honestly it's not the most elegant, but it gets the job done.
(BTW: I'm using zoo::rollmean because I don't have caTools installed, but it's the same effect I believe.)

Related

R { : the condition has length > 1 [duplicate]

This question already has answers here:
Interpreting "condition has length > 1" warning from `if` function
(7 answers)
Closed 7 months ago.
this is my first time asking a question in StackOverflow and also my first time coding using R
So, please understand if my explanation is unclear :(
I now have a data frame (data2000) that is 1092 x 6
The headers are year, month, predictive horizon, name of the company, GDP Price Index, and Consumer Price Index
I want to create vectors on gdppi and cpi for each month
My ultimate goal is to get the mean, median, interquartile range, and 90th-10th percentile range for each month and I thought this is the first step
and this is the code that I wrote by far
***library(tidyverse)
data2000 <- read.csv("")
for (i in 1:12) {
i_gdppi <- c()
i_cpi <- c()
}
for (i in 1:12) {
if (data2000$month == i) {
append(i_gdppi,data2000[,gdppi])
append(i_cpi, data2000[,cpi])
}
}***
Unfortunately, I got an error message saying that
Error in if (data2000$month == 1) { : the condition has length > 1
I googled it by myself and in if statement, I cannot use a vector as a condition
How can I solve this problem?
Thank you so much and have a nice day!
If you use the group_by() function then it takes care of sub-setting your data:
library(dplyr)
data2000 <- data.frame(month = rep(c(1:12), times = 2), gdppi = runif(24)*100) # Dummy data
data2000 |>
group_by(month) |>
summarise(mean = mean(gdppi), q10 = quantile(gdppi, probs = .10), q25 = quantile(gdppi, probs = .25)) # Add the other percentiles, as needed
Gives this
# A tibble: 12 x 4
month mean q10 q25
<int> <dbl> <dbl> <dbl>
1 1 12.5 3.44 6.83
2 2 34.7 7.15 17.5
3 3 37.8 22.1 28.0
4 4 30.3 19.0 23.2
5 5 65.7 62.2 63.5
6 6 60.7 38.7 47.0
7 7 43.0 38.2 40.0
8 8 77.9 60.7 67.1
9 9 56.3 44.0 48.6
10 10 53.1 19.6 32.2
11 11 63.8 40.6 49.3
12 12 59.0 49.2 52.9
If you have years and months, then group_by(year, month)

Sum total distance by groups

I have a df tracking movement of points each hour. I want to find the total distance traveled by that group/trial by adding the distance between the hourly coordinates, but I'm confusing myself with apply functions.
I want to say "in each group/trial, sum [distance(hour1-hou2), distance(hour2=hour3), distance(hour3-hour4)....] until current hour so on each line, I have a cumulative distance travelled value.
I've created a fake df below.
paths <- data.frame(matrix(nrow=80,ncol=5))
colnames(paths) <- c("trt","trial","hour","X","Y")
paths$trt <- rep(c("A","B","C","D"),each=20)
paths$trial <- rep(c(rep(1,times=10),rep(2,times=10)),times=4)
paths$hour <- rep(1:10,times=8)
paths[,4:5] <- runif(160,0,50)
#this shows the paths that I want to measure.
ggplot(data=paths,aes(x=X,y=Y,group=interaction(trt,trial),color=trt))+
geom_path()
I probably want to add a column paths$dist.traveled to keep track each hour.
I think I could use apply or maybe even aggregate but I've been using PointDistance to find the distances, so I'm a bit confused. I also would rather not do a loop inside a loop, because the real dataset is large.
Here's an answer that uses {dplyr}:
library(dplyr)
paths %>%
arrange(trt, trial, hour) %>%
group_by(trt, trial) %>%
mutate(dist_travelled = sqrt((X - lag(X))^2 + (Y - lag(Y))^2)) %>%
mutate(total_dist = sum(dist_travelled, na.rm = TRUE)) %>%
ungroup()
If you wanted the total distance but grouped only by trt and not trial you would just remove that from the call to group_by().
Is this what you are trying to achieve?:
paths %>%
mutate(dist.traveled = sqrt((X-lag(X))^2 + (Y-lag(Y))^2))
trt trial hour X Y dist.traveled
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 1 1 11.2 26.9 NA
2 A 1 2 20.1 1.48 27.0
3 A 1 3 30.4 0.601 10.4
4 A 1 4 31.1 26.6 26.0
5 A 1 5 38.1 30.4 7.88
6 A 1 6 27.9 47.9 20.2
7 A 1 7 16.5 35.3 16.9
8 A 1 8 0.328 13.0 27.6
9 A 1 9 14.0 41.7 31.8
10 A 1 10 29.7 7.27 37.8
# ... with 70 more rows
paths$dist.travelled[which(paths$hour==1)] <- NA
paths %>%
group_by(trt)%>%
summarise(total_distance = sum(dist.traveled, na.rm = TRUE))
trt total_distance
<chr> <dbl>
1 A 492.
2 B 508.
3 C 479.
4 D 462.
I am adding the new column to calculate distances for each group, and them sum them up.

How to calculate thermal indices in R using WorldClim data

Is there a way of plotting global Warmth Index using WorldClim data in R?
For those not familiar with Warmth Index, it's an equation written by Yim & Kira to describe length and intensity of a growing period, see here: https://www.jstage.jst.go.jp/article/seitai/25/2/25_KJ00001775740/_pdf/-char/en
My example: I have a data set of locations for plant populations where I used WorldClim data to derive monthly mean temperature at each location, and have them described in a tibble:
## # A tibble: 5 x 18
## species latitude longitude temp_1 temp_2 temp_3 temp_4 temp_5 temp_6
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Magnol… 31.0 -91.5 9.05 11.1 15.5 19.6 23.1 26.3
## 2 Magnol… 35.7 -93.2 2.45 4.89 10.2 15.5 19.5 23.8
## 3 Magnol… 35.7 -93.2 2.45 4.89 10.2 15.5 19.5 23.8
## 4 Magnol… 43.2 -76.3 -5.33 -4.55 0.98 7.42 13.7 18.5
## 5 Magnol… 35.6 -92.9 2.45 4.89 10.2 15.5 19.5 23.8
## # … with 9 more variables: temp_7 <dbl>, temp_8 <dbl>, temp_9 <dbl>,
## # temp_10 <dbl>, temp_11 <dbl>, temp_12 <dbl>, valid_cells <dbl>,
## # warmth_index <dbl>, row_id <int>
The data is then reshaped from wide to long:
reshaped_data <- raw_data %>%
tidyr::gather(key = "month", value = "temp", temp_1:temp_12) %>%
mutate(month = stringr::str_remove(month, "temp_") %>% readr::parse_number(),
warm = case_when(temp > 5 ~ TRUE,
TRUE ~ FALSE))
Using Yim & Kira's equation, a colleague wrote the following to calculate the Warmth Index at each location:
warmth_index <- function(warm, temp){
warm_months <- sum(warm)
temp_sum <- sum(warm * temp) # when multiplied, the warm logical vector becomes 0 & 1
temp_sum - (5 * warm_months)
}
This equation allows me to calculate the Warmth Index using mean temperatures at specific locations, and it does this once I've reshaped all the data.
But my issue is this: I'd like to find all the places in the world where a similar Warmth Index is found. My guess is that I should use RasterStacks (e.g. https://www.benjaminbell.co.uk/2018/02/rasterstacks-and-rasterplot.html) to bundle all the WorldClim tiff files together and the calc() function, like you would to calculate Max or Min global temperature (e.g.
ma.t.MIN <- calc(ma.t.min, min)
ma.t.MAX <- calc(ma.t.max, max)
But I'm not sure how to apply the Warmth Index equation to a RasterStack as it relies upon a reshaped tibble, rather than tiffs in my project folder... any ideas how to do it? Ultimately I'd like to end up with a plot, showing the world graded by Warmth Index.

How to fit a function for different groups in a data set using R

Please, how can I fit a function for different groups in a data set (Soil) using R. the first column is the group i.e. Plot and the second column is the observed variable i.e. Depth
Plot Depth
1 12.5
1 14.5
1 15.8
1 16.1
1 18.9
1 21.2
1 23.4
1 25.7
2 13.1
2 15.0
2 15.8
2 16.3
2 17.4
2 18.6
2 22.6
2 24.1
2 25.6
3 11.5
3 12.2
3 13.9
3 14.7
3 18.9
3 20.5
3 21.6
3 22.6
3 24.1
3 25.8
4 10.2
4 21.5
4 15.1
4 12.3
4 10.0
4 13.5
4 16.5
4 19.2
4 17.6
4 14.1
4 19.7
I used the 'for' statement but only saw output for Plot 1.
This was how I applied the 'for' statement:
After importing my data in R, I saved it as: SNq,
for (i in 1:SNq$Plot[i]) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
fit1
}
I think this should work. Just make one change in your code:
Why would it work ?
Because: unique function will return unique values (1,2,3) which are nothing but the groups in Plot column. With unique value, we can subset the data using SNq$Depth[SNq$Plot==i] and get depth value for that group.
for (i in unique(SNq$Plot)) { # <- here
dp <- SNq$Depth[SNq$Plot==i]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
plot(fit1)
}
A tidyverse suggestion:
library("tidyverse")
library("fitdistrplus")
fits <- SNq %>%
group_by(Plot) %>%
nest() %>%
mutate(fits = map(data, ~ fitdist(data = .$Depth, distr = "gamma")),
summaries = map(fit, summary))
You could continue with print(fits$fits) and print(fits$summaries) to access the different fits and their summary. Alternatively you can use a syntax like fits$fits[[1]] and fits$summaries[[1]] to access them.
Try:
for (i in 1:nrow(SNq)) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma")
fit1
}

R generate bins from a data frame respecting blanks

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

Resources