How to calculate thermal indices in R using WorldClim data - r

Is there a way of plotting global Warmth Index using WorldClim data in R?
For those not familiar with Warmth Index, it's an equation written by Yim & Kira to describe length and intensity of a growing period, see here: https://www.jstage.jst.go.jp/article/seitai/25/2/25_KJ00001775740/_pdf/-char/en
My example: I have a data set of locations for plant populations where I used WorldClim data to derive monthly mean temperature at each location, and have them described in a tibble:
## # A tibble: 5 x 18
## species latitude longitude temp_1 temp_2 temp_3 temp_4 temp_5 temp_6
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Magnol… 31.0 -91.5 9.05 11.1 15.5 19.6 23.1 26.3
## 2 Magnol… 35.7 -93.2 2.45 4.89 10.2 15.5 19.5 23.8
## 3 Magnol… 35.7 -93.2 2.45 4.89 10.2 15.5 19.5 23.8
## 4 Magnol… 43.2 -76.3 -5.33 -4.55 0.98 7.42 13.7 18.5
## 5 Magnol… 35.6 -92.9 2.45 4.89 10.2 15.5 19.5 23.8
## # … with 9 more variables: temp_7 <dbl>, temp_8 <dbl>, temp_9 <dbl>,
## # temp_10 <dbl>, temp_11 <dbl>, temp_12 <dbl>, valid_cells <dbl>,
## # warmth_index <dbl>, row_id <int>
The data is then reshaped from wide to long:
reshaped_data <- raw_data %>%
tidyr::gather(key = "month", value = "temp", temp_1:temp_12) %>%
mutate(month = stringr::str_remove(month, "temp_") %>% readr::parse_number(),
warm = case_when(temp > 5 ~ TRUE,
TRUE ~ FALSE))
Using Yim & Kira's equation, a colleague wrote the following to calculate the Warmth Index at each location:
warmth_index <- function(warm, temp){
warm_months <- sum(warm)
temp_sum <- sum(warm * temp) # when multiplied, the warm logical vector becomes 0 & 1
temp_sum - (5 * warm_months)
}
This equation allows me to calculate the Warmth Index using mean temperatures at specific locations, and it does this once I've reshaped all the data.
But my issue is this: I'd like to find all the places in the world where a similar Warmth Index is found. My guess is that I should use RasterStacks (e.g. https://www.benjaminbell.co.uk/2018/02/rasterstacks-and-rasterplot.html) to bundle all the WorldClim tiff files together and the calc() function, like you would to calculate Max or Min global temperature (e.g.
ma.t.MIN <- calc(ma.t.min, min)
ma.t.MAX <- calc(ma.t.max, max)
But I'm not sure how to apply the Warmth Index equation to a RasterStack as it relies upon a reshaped tibble, rather than tiffs in my project folder... any ideas how to do it? Ultimately I'd like to end up with a plot, showing the world graded by Warmth Index.

Related

Problems with appending t.test results in a for loop

Let me take simulated datasets to explain:
I have dataset dt and dt1
# dataset 1 `dt`
set.seed(12)
dt <- rnorm(5000,mean=10,sd=1)
dt <- data.frame(dt)
dt$group <- c("case","control")
colnames(dt) <- c("severity","group")
head(dt)
severity group
1 8.519432 case
2 11.577169 control
3 9.043256 case
4 9.079995 control
5 8.002358 case
6 9.727704 control
# dataset 2 `dt2`
set.seed(12)
dt2 <- rnorm(200,mean=12,sd=1)
dt2 <- data.frame(dt2)
dt2$group <- c("case2","control2")
colnames(dt2) <- c("severity","group")
head(dt2)
severity group
1 10.51943 case2
2 13.57717 control2
3 11.04326 case2
4 11.07999 control2
5 10.00236 case2
6 11.72770 control2
I am building one 1000 iterations for loop to do the following steps:
randomly take 500 rows from the dt and save as dt_sub
rbind dt_sub with dt2 and save as bd
select only rows with group as either case2 or control from the bd dataset (only cares the difference between these two groups)
t.tests on the variable severity between the case2 and control group
output t.tests results to t
use a for loop to repeat 1000 times
iteratively appends all t.test results to a dataframe results
Following is the code that I built in r
library(broom)
library(dplyr)
iter <- 1000
t <- data.frame()
for (i in 1:iter) {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t
t$iter <- i
}
results <- do.call(rbind,t)
My question is, this code works well when iter=1, but how should I set the compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t line to ensure each run's t.test results will not be overwritten when iter ≥ 1? I tried t[i] but failed, anyone could advise please?
Thanks.
Create a function which runs the process once.
library(broom)
library(dplyr)
t_test_function <- function() {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>%
group_by(group) %>%
do(tidy(t.test(severity ~ group,data = compare))) %>%
ungroup
}
t_test_function()
# group estimate estimate1 estimate2 statistic p.value parameter conf.low
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 case2 1.94 11.9 9.99 17.4 9.40e-42 199. #1.72
#2 cont… 1.94 11.9 9.99 17.4 9.40e-42 199. 1.72
# … with 3 more variables: conf.high <dbl>, method <chr>,
# alternative <chr>
Now you can call this iter times using replicate and combine the dataset.
iter <- 5
results <- bind_rows(replicate(iter, t_test_function(), simplify = FALSE), .id = 'iter')
# A tibble: 10 x 12
# iter group estimate estimate1 estimate2 statistic p.value parameter
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 case2 1.88 11.9 10.1 17.3 1.05e-40 189.
# 2 1 cont… 1.88 11.9 10.1 17.3 1.05e-40 189.
# 3 2 case2 1.96 11.9 9.97 17.8 9.88e-43 194.
# 4 2 cont… 1.96 11.9 9.97 17.8 9.88e-43 194.
# 5 3 case2 1.94 11.9 9.99 17.9 3.76e-42 184.
# 6 3 cont… 1.94 11.9 9.99 17.9 3.76e-42 184.
# 7 4 case2 2.03 11.9 9.90 18.6 1.82e-44 189.
# 8 4 cont… 2.03 11.9 9.90 18.6 1.82e-44 189.
# 9 5 case2 1.96 11.9 9.97 18.1 7.05e-43 187.
#10 5 cont… 1.96 11.9 9.97 18.1 7.05e-43 187.
# … with 4 more variables: conf.low <dbl>, conf.high <dbl>, method <chr>,
# alternative <chr>

Sum total distance by groups

I have a df tracking movement of points each hour. I want to find the total distance traveled by that group/trial by adding the distance between the hourly coordinates, but I'm confusing myself with apply functions.
I want to say "in each group/trial, sum [distance(hour1-hou2), distance(hour2=hour3), distance(hour3-hour4)....] until current hour so on each line, I have a cumulative distance travelled value.
I've created a fake df below.
paths <- data.frame(matrix(nrow=80,ncol=5))
colnames(paths) <- c("trt","trial","hour","X","Y")
paths$trt <- rep(c("A","B","C","D"),each=20)
paths$trial <- rep(c(rep(1,times=10),rep(2,times=10)),times=4)
paths$hour <- rep(1:10,times=8)
paths[,4:5] <- runif(160,0,50)
#this shows the paths that I want to measure.
ggplot(data=paths,aes(x=X,y=Y,group=interaction(trt,trial),color=trt))+
geom_path()
I probably want to add a column paths$dist.traveled to keep track each hour.
I think I could use apply or maybe even aggregate but I've been using PointDistance to find the distances, so I'm a bit confused. I also would rather not do a loop inside a loop, because the real dataset is large.
Here's an answer that uses {dplyr}:
library(dplyr)
paths %>%
arrange(trt, trial, hour) %>%
group_by(trt, trial) %>%
mutate(dist_travelled = sqrt((X - lag(X))^2 + (Y - lag(Y))^2)) %>%
mutate(total_dist = sum(dist_travelled, na.rm = TRUE)) %>%
ungroup()
If you wanted the total distance but grouped only by trt and not trial you would just remove that from the call to group_by().
Is this what you are trying to achieve?:
paths %>%
mutate(dist.traveled = sqrt((X-lag(X))^2 + (Y-lag(Y))^2))
trt trial hour X Y dist.traveled
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 1 1 11.2 26.9 NA
2 A 1 2 20.1 1.48 27.0
3 A 1 3 30.4 0.601 10.4
4 A 1 4 31.1 26.6 26.0
5 A 1 5 38.1 30.4 7.88
6 A 1 6 27.9 47.9 20.2
7 A 1 7 16.5 35.3 16.9
8 A 1 8 0.328 13.0 27.6
9 A 1 9 14.0 41.7 31.8
10 A 1 10 29.7 7.27 37.8
# ... with 70 more rows
paths$dist.travelled[which(paths$hour==1)] <- NA
paths %>%
group_by(trt)%>%
summarise(total_distance = sum(dist.traveled, na.rm = TRUE))
trt total_distance
<chr> <dbl>
1 A 492.
2 B 508.
3 C 479.
4 D 462.
I am adding the new column to calculate distances for each group, and them sum them up.

how to use map function in r to find the range and quantile

I first simulated 500 samples of size 55 in the normal distribution.
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
1) For each sample, I want the mean, median, range, and third quartile. Then I need to store these together in a data frame.
This is what I have. I am not sure about the range or the quantile. I tried sapply and lapply but not sure how they work.
stats <- data.frame(
means = map_dbl(samples,mean),
medians = map_dbl(samples,median),
sd= map_dbl(samples,sd),
range= map_int(samples, max-min),
third_quantile=sapply(samples,quantile,type=3)
)
2) Then plot the sampling distribution (histogram) of the means.
I try to plot but I don't get how to get the mean
stats <- gather(stats, key = "Trials", value = "Mean")
ggplot(stats,aes(x=Trials))+geom_histogram()
3) Then I want to plot the other three statistics in (three separate graphs) of a single plotting window.
I know I need to use something like gather and facet_wrap, but I am not sure how to do it.
You were almost there. All it is needed is to define anonymous functions wherever there are errors.
library(tidyverse)
set.seed(1234) # Make the results reproducible
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
str(samples)
stats <- data.frame(
means = map_dbl(samples, mean),
medians = map_dbl(samples, median),
sd = map_dbl(samples, sd),
range = map_dbl(samples, function(x) diff(range(x))),
third_quantile = map_dbl(samples, function(x) quantile(x, probs = 3/4, type = 3))
)
str(stats)
#'data.frame': 500 obs. of 5 variables:
# $ means : num 49.8 51.5 52.2 50.2 51.6 ...
# $ medians : num 51.5 51.7 51 51.1 50.5 ...
# $ sd : num 9.55 7.81 11.43 8.97 10.75 ...
# $ range : num 38.5 37.2 54 36.7 60.2 ...
# $ third_quantile: num 57.7 56.2 58.8 55.6 57 ...
The map_dbl functions you're using are definitely nice, but if you're trying to get a data frame in the end anyway, you might have an easier time converting the list into a data frame at the beginning, then taking advantage of some dplyr functions.
I'm first mapping over the list, creating tibbles, and binding it together with an added ID. The conversion creates a column value of the sample values. summarise_at lets you take a list of functions—supplying names in the list sets the names in the resultant data frame. You can use purrr's ~. notation to define these functions inline where needed. Cuts down on the number of times you have to map_dbl and so on.
library(tidyverse)
stats <- samples %>%
map_dfr(as_tibble, .id = "sample") %>%
group_by(sample) %>%
summarise_at(vars(value),
.funs = list(mean = mean, median = median, sd = sd,
range = ~(max(.) - min(.)),
third_quartile = ~quantile(., probs = 0.75)))
head(stats)
#> # A tibble: 6 x 6
#> sample mean median sd range third_quartile
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 45.0 44.4 8.71 47.6 48.6
#> 2 10 51.0 52.0 9.55 49.3 56.2
#> 3 100 51.6 52.2 10.4 60.7 58.1
#> 4 101 51.6 51.1 9.92 37.6 57.2
#> 5 102 49.1 48.2 9.65 39.8 57.0
#> 6 103 52.2 51.3 10.1 47.4 58.5
Next, in your code you gathered the data—which is often the solution folks need on SO—but if you're only trying to show the mean column, you can work with it as is.
ggplot(stats, aes(x = mean)) +
geom_histogram()

(R, dplyr) Put summarise_all data into a data frame of its own

Unfortunately I don't have enough points to make a comment, but I have a lingering question about the help in this post: (R, dplyr) select multiple columns starts with same string and summarise mean (90% CI) by group
Here's the working code from that post:
dat %>%
group_by(case) %>%
select(starts_with('ab')) %>%
summarise_all(funs('mean' = mean, 'ub' = quantile(., .95), 'lb' = quantile(., .05)))
# # A tibble: 3 x 7
# case abc_mean abe_mean abc_ub abe_ub abc_lb abe_lb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 case1 1.5 2.0 1.95 2.90 1.05 1.10
# 2 case2 2.0 2.5 2.90 2.95 1.10 2.05
# 3 case3 2.0 4.0 2.00 4.00 2.00 4.00
Can I direct this output into a dataframe of its own? I'm having two issues, first, it's not printing all of my data instead it prints:
# A tibble: 2 x 13
STATION TMPMX1_mean TMPMX2_mean TMPMX3_mean TMPMX4_mean TMPMX5_mean TMPMX6_mean TMPMX7_mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 300220 -1.45 -0.13 5.45 12.8 19.3 24.0 26.2
2 303025 -1.77 -0.53 4.92 12.0 18.6 23.2 25.5
# ... with 5 more variables: TMPMX8_mean <dbl>, TMPMX9_mean <dbl>, TMPMX10_mean <dbl>,
# TMPMX11_mean <dbl>, TMPMX12_mean <dbl>
When I try to put this data into a dataframe, using the previous example it would look like this:
df<-dat %>%
group_by(case) %>%
select(starts_with('ab')) %>%
summarise_all(funs('mean' = mean))
But I get a "group_by" error saying "Adding missing grouping variables: 'case'"
So my questions are (1) how do I get summarise_all to show all data and (2) how do I route this data directly into a data frame.
Thank you!

How to reference "cells" within a column in R?

I'm trying to calculate numeric ranges based on the moving average of a column of data. I have found a way to use caTools::runmean to produce a column of moving averages, and I know how to work with this in Excel to produce the columns I want, but I would love to know a way to do all of this in one R script.
Here is my simplified reproducible example for R.
library(tidyverse)
library(caTools)
data <- as_tibble(data.frame(
Index = as.integer(c(18,19,21,22,23,25,26,29)),
mydbl = c(8.905,13.31,15.739,17.544,19.054,20.393,21.623,22.764)))
data <- data %>%
mutate(avg = runmean(mydbl,
k = 2,
alg = "exact",
endrule = "NA"))
This tibble will look like this:
> data
# A tibble: 8 x 3
Index mydbl avg
<int> <dbl> <dbl>
1 18 8.90 NA
2 19 13.3 11.1
3 21 15.7 14.5
4 22 17.5 16.6
5 23 19.1 18.3
6 25 20.4 19.7
7 26 21.6 21.0
8 29 22.8 22.2
To produce the remaining data I want, I exported this to Excel with write_csv(data,...) and the final table is shown below. The first value in dbl_i is the formula =B2-ABS(C3-B2) (the difference between mydbl and the next avg subtracted from mydbl to create an equidistant lower limit). The last value in dbl_f is the formula =B9+ABS(C9-B9) (the difference between mydbl and the avg added to mydbl to create an equidistant upper limit). The other values in the two columns are just direct references to the avg column.
Index mydbl avg dbl_i dbl_f
18 8.905 NA 6.7025 11.1075
19 13.31 11.1075 11.1075 14.5245
21 15.739 14.5245 14.5245 16.6415
22 17.544 16.6415 16.6415 18.299
23 19.054 18.299 18.299 19.7235
25 20.393 19.7235 19.7235 21.008
26 21.623 21.008 21.008 22.1935
29 22.764 22.1935 22.1935 23.3345
Yes, the dbl_i is just the avg column but with the first value being =B2-abs(C3-B2). And the dbl_f column is the same as the avg column except it's moved up one, and the final value is =B9+abs(C9=B9). Ultimately it seems the real problem lies in finding a way to reproduce the Excel calculations D2=B2-ABS(C3-B2) and E9=B9+ABS(C9-B9).
Does anyone know how they would reproduce these calculations in R? I was looking for a way to create a formula in R that could be the equivalent of B2-ABS(C3-B2), but could not find one, unless I create a matrix instead. Do I have to create a matrix?
Thanks for your time.
data %>%
mutate(
avg = zoo::rollmean(mydbl, 2, align="right", fill=NA),
dbl_i = if_else(row_number() == 1L, mydbl - abs(lead(avg) - mydbl), avg),
dbl_f = if_else(row_number() == n(), mydbl + abs(avg - mydbl), lead(avg))
)
# # A tibble: 8 x 5
# Index mydbl avg dbl_i dbl_f
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 18 8.90 NA 6.70 11.1
# 2 19 13.3 11.1 11.1 14.5
# 3 21 15.7 14.5 14.5 16.6
# 4 22 17.5 16.6 16.6 18.3
# 5 23 19.1 18.3 18.3 19.7
# 6 25 20.4 19.7 19.7 21.0
# 7 26 21.6 21.0 21.0 22.2
# 8 29 22.8 22.2 22.2 23.3
Honestly it's not the most elegant, but it gets the job done.
(BTW: I'm using zoo::rollmean because I don't have caTools installed, but it's the same effect I believe.)

Resources