Calculating distance between all locations to first location, by group - r

I have GPS locations from several seabird tracks, each starting from colony x. Therefore the individual tracks all have similar first locations. For each track, I would like to calculate the beeline distance between each GPS location and either (a) a specified location that represents the location of colony x, or (b) the first GPS point of a given track which represents the location of colony x. For (b), I would look to use the first location of each new track ID (track_id).
I have looked for appropriate functions in geosphere, sp, raster, adehabitatLT, move, ... and just cannot seem to find what I am looking for.
I can calculate the distance between successive GPS points, but that is not what I need.
package(dplyr)
df %>%
group_by(ID) %>%
mutate(lat_prev = lag(Lat,1), lon_prev = lag(Lon,1) ) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(lon_prev, lat_prev), ncol = 2), # or use distHaversine
matrix(c(Lon, Lat), ncol = 2)))
#example data:
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
Grateful for any pointers.

You were pretty close. The key is that you want to calcualte the distance from the first observation in each track. Therefore you need to first adorn with the order in each track (easy to do with dplyr::row_number()). Then for the distance calculation, make the reference observation always the first by subsetting with order == 1.
library(tidyverse)
library(geosphere)
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
df %>%
group_by(ID) %>%
mutate(order = row_number()) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(Lon[order == 1], Lat[order == 1]), ncol = 2),
matrix(c(Lon, Lat), ncol = 2)))
#> # A tibble: 26 x 5
#> # Groups: ID [2]
#> Lon Lat ID order dist
#> <dbl> <dbl> <chr> <int> <dbl>
#> 1 -96.8 25.4 ID1 1 0
#> 2 -96.6 25.9 ID1 2 60714.
#> 3 -96.9 26.6 ID1 3 131665.
#> 4 -96.1 27.6 ID1 4 257404.
#> 5 -92.8 29.0 ID1 5 564320.
#> 6 -90.9 28.2 ID1 6 665898.
#> 7 -90.1 28.2 ID1 7 732131.
#> 8 -84.6 26.7 ID1 8 1225193.
#> 9 -83.7 25.1 ID1 9 1319482.
#> 10 -82 24.5 ID1 10 1497199.
#> # ... with 16 more rows
Created on 2022-01-09 by the reprex package (v2.0.1)

This also seems to work (sent to me by a friend) - very similar to Dan's suggestion above, but slightly different
library(geosphere)
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Dist_to_col = distHaversine(c(Lon[1], Lat[1]),cbind(Lon,Lat)))

Related

R: Comparing Subgroups From Different Datasets

I am working with the R programming language.
I have the following dataset that contains the heights and weights of people from Canada - using the value of height (cm), I split weight (kg) into bins based on ntiles, and calculated the average value of var2 within each ntile bin:
library(dplyr)
library(gtools)
set.seed(123)
canada = data.frame(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
Part_1 = canada %>%
mutate(quants = quantcut(weight, 100),
rank = as.numeric(quants)) %>%
group_by(quants) %>%
mutate(min = min(weight), max = max(weight), count = n(), avg_height = mean(height))
Part_1 = Part_1 %>% distinct(rank, .keep_all = TRUE)
> Part_1
# A tibble: 100 x 8
# Groups: quants [100]
height weight quants rank min max count avg_height
<dbl> <dbl> <fct> <dbl> <dbl> <dbl> <int> <dbl>
1 144. 114. (110.2,113.9] 99 110. 114. 100 150.
2 148. 88.3 (88.12,88.38] 44 88.1 88.4 100 149.
3 166. 99.3 (99.1,99.52] 83 99.1 99.5 100 152.
4 151. 84.3 (84.14,84.44] 29 84.1 84.4 100 150.
For example, I see that there are 100 people between the weight range of 100.2 - 113.9 kg and the average height of these people is 150 cm
Now, suppose I have a similar dataset for people from the USA:
set.seed(124)
usa = data.frame(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
My Question: Based on the weight ranges I calculated using the Canada dataset - I want to find out how many people from the USA fall within these Canadian ranges and what is the average weight of the Americans within these Canadian ranges
For example:
In the Canada dataset, I saw that there are 100 people between the weight range of 100.2 - 113.9 kg and the average height of these people is 150 cm
How many Americans are between the weight range of 100.2 - 113.9 kg and what is the average height of these Americans?
I know that I can do this manually for each rank:
americans_in_canadian_rank99 = usa %>%
filter(weight > 110.2 & weight < 113.9) %>%
group_by() %>%
summarize(count = n(), avg_height = mean(height))
americans_in_canadian_rank44 = usa %>%
filter(weight > 88.1 & weight < 88.4) %>%
group_by() %>%
summarize(count = n(), avg_height = mean(height))
In the end, I would be looking for something a desired output like this:
# number of rows should be = number of unique ranks
canadian_rank min_weight max_weight canadian_count canadian_avg_height american_count american_avg_height
1 99 110.2 113.9 100 150 116 150
2 44 88.1 88.4 100 149 154 150
Can someone please help me figure out a better way to do this?
Thanks!
Note: updated based on the desired output format combining the two sets:
This can be done in a straight-forward manner using the non-equijoin functionality of data.table.
library(data.table)
library(gtools)
set.seed(123)
canada = data.table(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
set.seed(124)
usa = data.table(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
## You can also use data.table to generate your Part_1 summary table
Part_1 <- canada[, .(min = min(weight),
max = max(weight),
count = .N,
avg_height = mean(height)), keyby = .(quants = quantcut(weight,100))]
Part_1[, rank := as.numeric(quants)]
## Join using a non-equi join to combine data sets
usa[Part_1, on = .(weight >= min,
weight < max)
## On the join result, compute same summary states by quants & rank
][, .(usa_count = .N,
usa_avg_height = mean(height)), keyby = .(rank,
quants,
## whenever we do a non-equijoin, the foreign key values, in this case min/max
## overwrite the local keys. Since we used weight twice, canadian min/max
## will show up in the join result table as weight and weight.1
min_weight = weight,
max_weight = weight.1,
## To keep both sets of results distinct, we can rename columns in our "by" statement
canadian_count = count,
canadian_avg_height = avg_height)]
Gives results as follows:
rank quants min_weight max_weight canadian_count canadian_avg_height usa_count usa_avg_height
1: 1 [55.11,66.71] 55.11266 66.69011 100 149.2101 114 149.8116
2: 2 (66.71,69.48] 66.70575 69.46055 100 149.0639 119 148.6486
3: 3 (69.48,71.15] 69.48011 71.13895 100 150.5331 94 148.4336
4: 4 (71.15,72.44] 71.14747 72.43042 100 150.4779 104 149.8926
Also, another option would be to assign result columns for the usa table directly back to your Part_1 summary table in place.
## This is a two-part nested join
Part_1[
## Start by creating a result that matches Part_1 ranks to all usa data
Part_1[usa,on = .(min <= weight,
max > weight)
## Compute aggregated results on the join table result
][,.(usa_count = .N,
usa_avg_height = mean(height)), by = .(rank)],
## Finaly, assign results back to the Part_1 summary table joined by rank
c("usa_count",
"usa_avg_height") := .(usa_count,
usa_avg_height), on = .(rank)]
Gives the following
quants min max count avg_height rank usa_count usa_avg_height
1: [55.11,66.71] 55.11266 66.69011 100 149.2101 1 114 149.8116
2: (66.71,69.48] 66.70575 69.46055 100 149.0639 2 119 148.6486
3: (69.48,71.15] 69.48011 71.13895 100 150.5331 3 94 148.4336
4: (71.15,72.44] 71.14747 72.43042 100 150.4779 4 104 149.8926
With data.table you can do this:
library(data.table)
library(stringr)
dt1 <- as.data.table(usa)
dt1 <- dt1[, c("min", "max") := weight]
dt2 <- as.data.table(Part_1 %>% select("quants", "rank"))
dt2 <- cbind(dt2[,.(rank)],
setDT(tstrsplit(str_sub(dt2$quants, 2, -2), ",", fixed = TRUE, names = c("min", "max"))))
dt2 <- dt2[, lapply(.SD, as.numeric)]
setkey(dt2, min, max)
dt1 <- dt1[, rank := dt2$rank[foverlaps(dt1, dt2, by.x = c("min", "max"), by.y = c("min", "max"), which = TRUE)$yid]] %>%
select(-c("min", "max"))
EDIT
Totally missed the last part. But if you wish to do that, it should be relatively straightforward from the last point (you could use dplyr for that if you wish):
dt3 <- rbind(canada %>%
mutate(quants = quantcut(weight, 100),
rank = as.numeric(quants),
country = "Canada") %>%
as.data.table(),
copy(dt1)[, country := "USA"], fill = TRUE)
dt3 <- dt3[,.(count = .N, avg_height = mean(height)), by = c("rank", "country")] %>%
dcast(rank ~ country, value.var = c("count", "avg_height")) %>%
merge(dt2 %>% rename("min_weight" = "min", "max_weight" = "max"), by = c("rank"), all.x = TRUE)
EDIT 2
Alternatively, you could try to do something similar using cut function without learning anything from data.table
rank_breaks <- Part_1 %>%
mutate(breaks = sub(",.*", "", str_sub(quants, 2)) %>% as.numeric()) %>%
arrange(rank) %>%
pull(breaks)
# Here I change minimum and maximum of groups 1 and 100 to -Inf and Inf respectively.
# If you do not wish to do so, you can disregard it and run `rank_breaks <- c(rank_breaks, max(canada$weight))` instead
rank_breaks[1] <- -Inf
rank_breaks <- c(rank_breaks, Inf)
usa <- usa %>%
mutate(rank = cut(weight, breaks = rank_breaks, labels = c(1:100)))
You can use fuzzyjoin for this.
library(fuzzyjoin)
# take percentile ranges and join US data
us_by_canadian_quantiles <- Part_1 |>
ungroup() |>
distinct(rank, min, max, height_avg_can = avg_height) |>
fuzzy_full_join(usa, by = c(min = "weight", max = "weight"), match_fun = c(`<`, `>=`))
# get count and height average per bin
us_by_canadian_quantiles |>
group_by(rank) |>
summarize(n_us = n(),
height_avg_us = mean(height),
height_avg_can = first(height_avg_can)
)
#> # A tibble: 101 × 4
#> rank n_us height_avg_us height_avg_can
#> <dbl> <int> <dbl> <dbl>
#> 1 1 114 150. 149.
#> 2 2 119 149. 149.
#> 3 3 94 148. 151.
#> 4 4 104 150. 150.
#> 5 5 115 152. 150.
#> 6 6 88 150. 149.
#> 7 7 86 150. 150.
#> 8 8 86 150. 151.
#> 9 9 102 151. 151.
#> 10 10 81 152. 150.
#> # … with 91 more rows
Note that there are a number of cases in the US frame which fall outside of the Canadian percentile ranges. They are grouped together here with rank being NA, but you could also add ranks 0 and 101 if you wanted to distinguish them.
I should note that fuzzyjoin tends to be much slower than data.table. But since you have already gotten a data.table solution, this might be more to your liking.

Aggregating a dataframe with dplyr in R based on several dummy variables

I am using dplyr to aggregate my dataframe, so it shows percentages of people choosing specific protein design tasks by company size. I have different dummy variables for protein design tasks, because this was a multiple choice question in a survey.
I figured out a way to do this, but my code is very long, because I aggregate the data per task and then join all these separate dataframes together into one. I’m curious whether there is a more elegant (shorter) way to do this?
library(tidyverse)
EarlyAccess <- read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1")
#################### STABILITY ################################################
Proportions_tasks_stability <- EarlyAccess %>%
select(size, Improving.stability..generic..thermal..pH.) %>%
group_by(size, Improving.stability..generic..thermal..pH.) %>%
summarise(count_var_stability=n())%>%
mutate(total_group_by_size = sum(count_var_stability)) %>%
mutate(pc_var_stability=count_var_stability/sum(count_var_stability)*100) %>%
filter(Improving.stability..generic..thermal..pH.=="Improving stability (generic, thermal, pH)") %>%
select(size, Improving.stability..generic..thermal..pH., pc_var_stability)
######################## ACTIVITY #############################################
Proportions_tasks_activity <- EarlyAccess %>%
select(size, Improving.activity ) %>%
group_by(size, Improving.activity) %>%
summarise(count_var_activity=n())%>%
mutate(total_group_by_size = sum(count_var_activity)) %>%
mutate(pc_var_activity=count_var_activity/sum(count_var_activity)*100) %>%
filter(Improving.activity=="Improving activity") %>%
select(size, Improving.activity, pc_var_activity)
######################## BINDING AFFINITY ######################################
Proportions_tasks_binding.affinity<- EarlyAccess %>%
select(size, Improving.binding.affinity ) %>%
group_by(size, Improving.binding.affinity) %>%
summarise(count_var_binding.affinity=n())%>%
mutate(total_group_by_size = sum(count_var_binding.affinity)) %>%
mutate(pc_var_binding.affinity=count_var_binding.affinity/sum(count_var_binding.affinity)*100) %>%
filter(Improving.binding.affinity=="Improving binding affinity") %>%
select(size, Improving.binding.affinity, pc_var_binding.affinity)
# Then join them
Protein_design_tasks <- Proportions_tasks_stability %>%
inner_join(Proportions_tasks_activity, by = "size") %>%
inner_join(Proportions_tasks_binding.affinity, by = "size")
Using the datafile you provided, this should give the percentages of the selected category within each column for each size:
library(tidyverse)
df <-
read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1")
df |>
group_by(size) |>
summarise(
pc_var_stability = sum(
Improving.stability..generic..thermal..pH. == "Improving stability (generic, thermal, pH)",
na.rm = TRUE
) / n() * 100,
pc_var_activity = sum(Improving.activity == "Improving activity",
na.rm = TRUE) / n() * 100,
pc_var_binding.affinity = sum(
Improving.binding.affinity == "Improving binding affinity",
na.rm = TRUE
) / n() * 100
)
#> # A tibble: 7 × 4
#> size pc_var_stability pc_var_activity pc_var_binding.affinity
#> <chr> <dbl> <dbl> <dbl>
#> 1 1000-10000 43.5 47.8 34.8
#> 2 10000+ 65 65 70
#> 3 11-50 53.8 53.8 46.2
#> 4 2-10 51.1 46.8 46.8
#> 5 200-1000 64.7 52.9 52.9
#> 6 50-200 42.1 42.1 36.8
#> 7 Just me 48.5 39.4 54.5
Looking at your data, each column has either the string value you're testing for or NA, so you could make it even shorter/tidier just by counting non-NAs in relevant columns:
df |>
group_by(size) |>
summarise(across(
c(
Improving.stability..generic..thermal..pH.,
Improving.activity,
Improving.binding.affinity
),
\(val) 100 * sum(!is.na(val)) / n()
))
If what you're aiming to do is summarise across all columns then the latter method may work best - there are several ways of specifying which columns you want and so you don't necessarily need to type all names and values in. You might also find it clearest to make calculating and formatting all percentages a named function to call:
library(tidyverse)
df <-
read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1",
show_col_types = FALSE)
perc_nonmissing <- function(val) {
sprintf("%.1f%%", 100 * sum(!is.na(val)) / n())
}
df |>
group_by(size) |>
summarise(across(-c(1:2), perc_nonmissing))
#> # A tibble: 7 × 12
#> size Disco…¹ Searc…² Under…³ Impro…⁴ Impro…⁵ Impro…⁶ Impro…⁷ Impro…⁸ Impro…⁹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1000-… 21.7% 17.4% 43.5% 47.8% 39.1% 43.5% 30.4% 39.1% 39.1%
#> 2 10000+ 40.0% 55.0% 55.0% 65.0% 70.0% 65.0% 20.0% 30.0% 40.0%
#> 3 11-50 30.8% 26.9% 42.3% 53.8% 38.5% 53.8% 15.4% 30.8% 38.5%
#> 4 2-10 38.3% 40.4% 48.9% 46.8% 36.2% 51.1% 23.4% 31.9% 42.6%
# etc.

Pivot_longer and Pivot wider syntax

I want to ask for ideas on creating a syntax to pivot_longer given on this.
I've already tried researching in the internet but I can't seem to find any examples that is similar to my data given where it has a Metric column which is also seperated in 3 different columns of months.
My desire final output is to have seven columns consisting of (regions,months, and the five Metrics)
How to formulate the pivot_longer and pivot_wider syntax to clean my data in order for me to visualize it?
The tricky part isn't pivot_longer. You first have to clean your Excel spreadsheet, i.e. get rid of empty rows and merge the two header rows containing the names of the variables and the dates.
One approach to achieve your desired result may look like so:
library(readxl)
library(tidyr)
library(janitor)
library(dplyr)
x <- read_excel("data/Employment.xlsx", skip = 3, col_names = FALSE) %>%
# Get rid of empty rows and cols
janitor::remove_empty()
# Make column names
col_names <- data.frame(t(x[1:2,])) %>%
fill(1) %>%
unite(name, 1:2, na.rm = TRUE) %>%
pull(name)
x <- x[-c(1:2),]
names(x) <- col_names
# Convert to long and values to numerics
x %>%
pivot_longer(-Region, names_to = c(".value", "months"), names_sep = "_") %>%
separate(months, into = c("month", "year")) %>%
mutate(across(!c(Region, month, year), as.numeric))
#> # A tibble: 6 × 8
#> Region month year `Total Population … `Labor Force Part… `Employment Rat…
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Philippin… April 2020f 73722. 55.7 82.4
#> 2 Philippin… Janu… 2021p 74733. 60.5 91.3
#> 3 Philippin… April 2021p 74971. 63.2 91.3
#> 4 National … April 2020f 9944. 54.2 87.7
#> 5 National … Janu… 2021p 10051. 57.2 91.2
#> 6 National … April 2021p 10084. 60.1 85.6
#> # … with 2 more variables: Unemployment Rate <dbl>, Underemployment Rate <dbl>

How can I delineate multiple watersheds in R using the streamstats package?

There is an R package in development that I would like to use called streamstats. What it does is delineate a watershed (within the USA) for a latitude & longitude point along a body of water and provides watershed characteristics such as drainage area and proportions of various land covers. What I would like to do is extract some watershed characteristics of interest from a data frame of several lat & long positions.
I can get the package to do what I want for one point
devtools::install_github("markwh/streamstats")
library(streamstats)
setTimeout(120)
dat1 <- data.frame(matrix(ncol = 3, nrow = 3))
x <- c("state","lat","long")
colnames(dat1) <- x
dat1$state <- c("NJ","NY","VA")
dat1$lat <- c(40.99194,42.02458,38.04235)
dat1$long <- c(-74.28000,-75.11928,-79.88144)
test_dat <- dat1[1,]
ws1 <- delineateWatershed(xlocation = test_dat$long, ylocation = test_dat$lat, crs = 4326,
includeparameters = "true", includeflowtypes = "true")
chars1 <- computeChars(workspaceID = ws1$workspaceID, rcode = "MA")
chars1$parameters
However what I would like is to be able to give the delineateWatershed function several watersheds at once (i.e., all 3 locations found in dat1) and combine the chars1$parameters output variables DRNAREA,FOREST,LC11DEV, and LC11IMP into a data frame. Maybe this could be achieved with a for loop?
The ideal output would look like this
state lat long DRNAREA FOREST LC11DEV LC11IMP
1 NJ 40.99194 -74.28000 160 66.2 26.20 5.50
2 NY 42.02458 -75.11928 457 89.3 2.52 0.18
3 VA 38.04235 -79.88144 158 NA 4.63 0.20
I would put what you have in a function then use purrr::pmap_df() to loop through each row in dat1 then bind all the results together. See also this answer
library(dplyr)
library(purrr)
library(tidyr)
library(streamstats)
setTimeout(120)
dat1 <- data.frame(matrix(ncol = 3, nrow = 2))
colnames(dat1) <- c("state", "lat", "long")
dat1$state <- c("NJ", "NY")
dat1$lat <- c(40.99194, 42.02458)
dat1$long <- c(-74.28000, -75.11928)
dat1
#> state lat long
#> 1 NJ 40.99194 -74.28000
#> 2 NY 42.02458 -75.11928
Define a function for catchment delineation
catchment_delineation <- function(rcode_in, lat_y, long_x) {
print(paste0("Processing for lat = ", lat_y, " and long = ", long_x))
ws <- delineateWatershed(xlocation = long_x, ylocation = lat_y, crs = 4326,
includeparameters = "true", includeflowtypes = "true")
ws_properties <- computeChars(workspaceID = ws$workspaceID, rcode = rcode_in)
# keep only what we need
ws_properties_df <- ws_properties$parameters %>%
filter(code %in% c("DRNAREA", "FOREST", "LC11DEV", "LC11IMP")) %>%
mutate(ID = ws$workspaceID,
state = rcode_in,
long = long_x,
lat = lat_y)
return(ws_properties_df)
}
Apply the function to each row in dat1 data frame
catchment_df <- pmap_df(dat1, ~ catchment_delineation(..1, ..2, ..3))
#> https://streamstats.usgs.gov/streamstatsservices/watershed.geojson?rcode=NJ&xlocation=-74.28&ylocation=40.99194&includeparameters=true&includeflowtypes=true&includefeatures=true&crs=4326https://streamstats.usgs.gov/streamstatsservices/parameters.json?rcode=NJ&workspaceID=NJ20210923064141811000&includeparameters=truehttps://streamstats.usgs.gov/streamstatsservices/watershed.geojson?rcode=NY&xlocation=-75.11928&ylocation=42.02458&includeparameters=true&includeflowtypes=true&includefeatures=true&crs=4326https://streamstats.usgs.gov/streamstatsservices/parameters.json?rcode=NY&workspaceID=NY20210923064248530000&includeparameters=true
catchment_df
#> ID name
#> 1 NJ20210923064141811000 Drainage Area
#> 2 NJ20210923064141811000 Percent Forest
#> 3 NJ20210923064141811000 Percent Developed from NLCD2011
#> 4 NJ20210923064141811000 Percent_Impervious_NLCD2011
#> 5 NY20210923064248530000 Drainage Area
#> 6 NY20210923064248530000 Percent Forest
#> 7 NY20210923064248530000 Percent Developed from NLCD2011
#> 8 NY20210923064248530000 Percent_Impervious_NLCD2011
#> description
#> 1 Area that drains to a point on a stream
#> 2 Percentage of area covered by forest
#> 3 Percentage of developed (urban) land from NLCD 2011 classes 21-24
#> 4 Average percentage of impervious area determined from NLCD 2011 impervious dataset
#> 5 Area that drains to a point on a stream
#> 6 Percentage of area covered by forest
#> 7 Percentage of developed (urban) land from NLCD 2011 classes 21-24
#> 8 Average percentage of impervious area determined from NLCD 2011 impervious dataset
#> code unit value state long lat
#> 1 DRNAREA square miles 160.00 NJ -74.28000 40.99194
#> 2 FOREST percent 66.20 NJ -74.28000 40.99194
#> 3 LC11DEV percent 26.20 NJ -74.28000 40.99194
#> 4 LC11IMP percent 5.50 NJ -74.28000 40.99194
#> 5 DRNAREA square miles 457.00 NY -75.11928 42.02458
#> 6 FOREST percent 89.30 NY -75.11928 42.02458
#> 7 LC11DEV percent 2.52 NY -75.11928 42.02458
#> 8 LC11IMP percent 0.18 NY -75.11928 42.02458
Reshape the result to desired format
catchment_reshape <- catchment_df %>%
select(state, long, lat, code, value) %>%
pivot_wider(names_from = code,
values_from = value)
catchment_reshape
#> # A tibble: 2 x 7
#> state long lat DRNAREA FOREST LC11DEV LC11IMP
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 NJ -74.3 41.0 160 66.2 26.2 5.5
#> 2 NY -75.1 42.0 457 89.3 2.52 0.18
Created on 2021-09-22 by the reprex package (v2.0.1)
Since you mentioned the use a for loop I thought why not make a solution of it.
Here is your data:
library(dplyr)
library(purrr)
library(tidyr)
library(streamstats)
setTimeout(120)
dat1 <- data.frame(matrix(ncol = 3, nrow = 2))
colnames(dat1) <- c("state", "lat", "long")
dat1$state <- c("NJ", "NY")
dat1$lat <- c(40.99194, 42.02458)
dat1$long <- c(-74.28000, -75.11928)
dat1
Create an empty list to store the watershed characteristics:
water_shed <- list()
Loop through the dat1 and return the properties for each respective longitude and latitude:
for(i in 1:nrow(dat1)){
water_shed[[i]] <-
delineateWatershed(xlocation = dat1$long[i], ylocation = dat1$lat[i], crs = 4326,
includeparameters = "true", includeflowtypes = "true")
}
Now create a list to store the watershed properties:
ws_properties <- list()
Loop through the water_shed returning the parameters of each location:
for(i in 1:length(water_shed)){
ws_properties[[i]] <- computeChars(workspaceID = water_shed[[i]][[1]], rcode = dat1$state)
}
Finally, create a dataframe for your desired outputs then append the properties for each location looping through the list of watershed properties:
# data frame:
ws_properties_df <- data.frame(state=character(),long=integer(), lat=integer(),
DRNAREA = integer(), FOREST = integer(), LC11DEV = integer(), LC11IMP = integer(),
stringsAsFactors=FALSE)
#append properties for eact location
for(i in 1:length(ws_properties)){
ws_properties_df[i,] <- ws_properties[[i]]$parameters %>%
filter(code %in% c("DRNAREA", "FOREST", "LC11DEV", "LC11IMP")) %>%
mutate(state = dat1$state[i],
long = dat1$long[i],
lat = dat1$lat[i]) %>%
select(state, long, lat, code, value) %>%
pivot_wider(names_from = code,
values_from = value)
}
Desired Output:

Translating a `for loop` into purrr using `map` function in r

I need to download weather data from NASA’s POWER (Prediction Of Worldwide Energy Resource). The package nasapower is a package developed for data retrieval using R. I need to download many locations (lat, long coordinates). To do this I tried a simple loop with three locations as a reproducible example.
library(nasapower)
data1 <- read.csv(text = "
location,long,lat
loc1, -56.547, -14.2427
loc2, -57.547, -15.2427
loc3, -58.547, -16.2427")
i=1
all.weather <- data.frame()
for (i in seq_along(1:nrow(data1))) {
weather.data <- get_power(community = "AG",
lonlat = c(data1$long[i],data1$lat[i]),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
pars = c("T2M_MAX"))
all.weather <-rbind(all.weather, weather.data)
}
This works perfect. The problem is that I am trying to mimic this using purrr::map since I want to have an alternative within tidyverse. This is what I did but it does not work:
library(dplyr)
library(purrr)
all.weather <- data1 %>%
group_by(location) %>%
map(get_power(community = "AG",
lonlat = c(long, lat),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
site_elevation = NULL,
pars = c("T2M_MAX")))
I got the following error:
Error in isFALSE(length(lonlat != 2)) : object 'long' not found
Any hint on how to run this using purrr?
To make your code work make use of purrr::pmap instead of map like so:
map is for one argument functions, map2 for two argument funs and pmap is the most general one allowing for funs with more than two arguments.
pmap will loop over the rows of your df. As your df has 3 columns 3 arguments are passed to the function, even if the first argument location is not used. To make this work and to make use of the column names you have to specify the function and the argument names via function(location, long, lat)
library(nasapower)
data1 <- read.csv(text = "
location,long,lat
loc1, -56.547, -14.2427
loc2, -57.547, -15.2427
loc3, -58.547, -16.2427")
library(dplyr)
library(purrr)
all.weather <- data1 %>%
pmap(function(location, long, lat) get_power(community = "AG",
lonlat = c(long, lat),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
site_elevation = NULL,
pars = c("T2M_MAX"))) %>%
# Name list with locations
setNames(data1$location) %>%
# Add location names as identifiers
bind_rows(.id = "location")
head(all.weather)
#> NASA/POWER SRB/FLASHFlux/MERRA2/GEOS 5.12.4 (FP-IT) 0.5 x 0.5 Degree Daily Averaged Data
#> Dates (month/day/year): 01/01/2015 through 01/10/2015
#> Location: Latitude -14.2427 Longitude -56.547
#> Elevation from MERRA-2: Average for 1/2x1/2 degree lat/lon region = 379.25 meters Site = na
#> Climate zone: na (reference Briggs et al: http://www.energycodes.gov)
#> Value for missing model data cannot be computed or out of model availability range: NA
#>
#> Parameters:
#> T2M_MAX MERRA2 1/2x1/2 Maximum Temperature at 2 Meters (C)
#>
#> # A tibble: 6 x 9
#> location LON LAT YEAR MM DD DOY YYYYMMDD T2M_MAX
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <date> <dbl>
#> 1 loc1 -56.5 -14.2 2015 1 1 1 2015-01-01 29.9
#> 2 loc1 -56.5 -14.2 2015 1 2 2 2015-01-02 30.1
#> 3 loc1 -56.5 -14.2 2015 1 3 3 2015-01-03 27.3
#> 4 loc1 -56.5 -14.2 2015 1 4 4 2015-01-04 28.7
#> 5 loc1 -56.5 -14.2 2015 1 5 5 2015-01-05 30
#> 6 loc1 -56.5 -14.2 2015 1 6 6 2015-01-06 28.7

Resources