Translating a `for loop` into purrr using `map` function in r - r

I need to download weather data from NASA’s POWER (Prediction Of Worldwide Energy Resource). The package nasapower is a package developed for data retrieval using R. I need to download many locations (lat, long coordinates). To do this I tried a simple loop with three locations as a reproducible example.
library(nasapower)
data1 <- read.csv(text = "
location,long,lat
loc1, -56.547, -14.2427
loc2, -57.547, -15.2427
loc3, -58.547, -16.2427")
i=1
all.weather <- data.frame()
for (i in seq_along(1:nrow(data1))) {
weather.data <- get_power(community = "AG",
lonlat = c(data1$long[i],data1$lat[i]),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
pars = c("T2M_MAX"))
all.weather <-rbind(all.weather, weather.data)
}
This works perfect. The problem is that I am trying to mimic this using purrr::map since I want to have an alternative within tidyverse. This is what I did but it does not work:
library(dplyr)
library(purrr)
all.weather <- data1 %>%
group_by(location) %>%
map(get_power(community = "AG",
lonlat = c(long, lat),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
site_elevation = NULL,
pars = c("T2M_MAX")))
I got the following error:
Error in isFALSE(length(lonlat != 2)) : object 'long' not found
Any hint on how to run this using purrr?

To make your code work make use of purrr::pmap instead of map like so:
map is for one argument functions, map2 for two argument funs and pmap is the most general one allowing for funs with more than two arguments.
pmap will loop over the rows of your df. As your df has 3 columns 3 arguments are passed to the function, even if the first argument location is not used. To make this work and to make use of the column names you have to specify the function and the argument names via function(location, long, lat)
library(nasapower)
data1 <- read.csv(text = "
location,long,lat
loc1, -56.547, -14.2427
loc2, -57.547, -15.2427
loc3, -58.547, -16.2427")
library(dplyr)
library(purrr)
all.weather <- data1 %>%
pmap(function(location, long, lat) get_power(community = "AG",
lonlat = c(long, lat),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
site_elevation = NULL,
pars = c("T2M_MAX"))) %>%
# Name list with locations
setNames(data1$location) %>%
# Add location names as identifiers
bind_rows(.id = "location")
head(all.weather)
#> NASA/POWER SRB/FLASHFlux/MERRA2/GEOS 5.12.4 (FP-IT) 0.5 x 0.5 Degree Daily Averaged Data
#> Dates (month/day/year): 01/01/2015 through 01/10/2015
#> Location: Latitude -14.2427 Longitude -56.547
#> Elevation from MERRA-2: Average for 1/2x1/2 degree lat/lon region = 379.25 meters Site = na
#> Climate zone: na (reference Briggs et al: http://www.energycodes.gov)
#> Value for missing model data cannot be computed or out of model availability range: NA
#>
#> Parameters:
#> T2M_MAX MERRA2 1/2x1/2 Maximum Temperature at 2 Meters (C)
#>
#> # A tibble: 6 x 9
#> location LON LAT YEAR MM DD DOY YYYYMMDD T2M_MAX
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <date> <dbl>
#> 1 loc1 -56.5 -14.2 2015 1 1 1 2015-01-01 29.9
#> 2 loc1 -56.5 -14.2 2015 1 2 2 2015-01-02 30.1
#> 3 loc1 -56.5 -14.2 2015 1 3 3 2015-01-03 27.3
#> 4 loc1 -56.5 -14.2 2015 1 4 4 2015-01-04 28.7
#> 5 loc1 -56.5 -14.2 2015 1 5 5 2015-01-05 30
#> 6 loc1 -56.5 -14.2 2015 1 6 6 2015-01-06 28.7

Related

How to read files in two separate lists in a function based on a condition in R

Okay, I hope I manage to sum up what I need to achieve. I am running experiments in which I obtain data from two different source, with a date_time being the matching unifying variable. The data in the two separate sources have the same structure (in csv or txt). The distinction is in the filenames. I provide an example below:
list_of_files <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
frequency = c("30000",
"31000",
"32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
frequency = c("20000",
"21000",
"22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
temperature = c("22.3",
"22.6",
"22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
temperature = c("24.5",
"24.6",
"24.8")
)
)
)
First I want to read the data files in two separate lists. One that contains the keyword "solid" in the file name, and the other one that contains "water".
Then I need to create a new columns from the filename in each data frame that will be separated by "_" (e.g paint = "epox1", thickness = "10"), by which I could do an inner join by the date_time column, paint, thickness,etc. Basically what I struggle so far is to create a function that loads that files in two separate lists. This is what I've tried so far
load_files <-
function(list_of_files) {
all.files.board <- list()
all.files.temp <- list()
for (i in 1:length(list_of_files))
{
if (exists("board")) {
all.files.board[[i]] = fread(list_of_files[i])
}
else{
all.files.temp[[i]] = fread(list_of_files[i])
}
return(list(all.files.board, all.files.temp))
}
}
But it doesn't do what I need it. I hope I made it as clear as possible. I'm pretty comfortable with the tidyverse package but writing still a newbie in writing custom functions. Any ideas welcomed.
Regarding question in the title -
first issue, calling return() too early and thus breaking a for-loop, was already mentioned in comments and that should be sorted.
next one is condition itself, if (exists("board")){} checks if there is an object called board; in provided sample it would evaluate to TRUE only if something was assigned to global board object before calling load_files() function and it would evaluate to FALSE only if there were no such assignment or board was explicitly removed. I.e. with
board <- "something"; dataframes <- load_files(file_list) that check will be TRUE while with
rm(board); dataframes <- load_files(file_list) it will be FALSE, there's nothing in function itself that would change the "existance" of board, so the result is actually determined before calling the function.
If core of the question is about joining 2 somewhat different datasets and splitting result by groups, I'd just drop loops, conditions and most of involved lists and would go with something like this with Tidyverse:
library(fs)
library(readr)
library(stringr)
library(dplyr)
library(tidyr)
# prepare input files for sample ------------------------------------------
sample_dfs <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
frequency = c("30000", "31000", "32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
frequency = c("20000", "21000", "22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
temperature = c("22.3", "22.6", "22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
temperature = c("24.5", "24.6", "24.8")
)
)
)
tmp_path <- file_temp("reprex")
dir_create(tmp_path)
sample_filenames <- str_glue("{1:length(sample_dfs)}_{names(sample_dfs)}.csv")
for (i in seq_along(sample_dfs)) {
write_csv(sample_dfs[[i]], path(tmp_path, sample_filenames[i]))
}
dir_ls(tmp_path, type = "file")
#> Temp/RtmpqUoct8/reprex5cc517f177b/1_solid_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/2_solid_otherpaint_20.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/3_water_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/4_water_otherpaint_20.csv
# read files --------------------------------------------------------------
t_solid <- dir_ls(tmp_path, glob = "*solid*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_solid
#> # A tibble: 6 × 5
#> paint thickness date_time frequency index
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1
#> 2 epoxy1 10 20/07/2022 13:56 31000 2
#> 3 epoxy1 10 20/07/2022 14:06 32000 3
#> 4 otherpaint 20 20/07/2022 13:10 20000 1
#> 5 otherpaint 20 20/07/2022 13:20 21000 2
#> 6 otherpaint 20 20/07/2022 14:30 22000 3
t_water <- dir_ls(tmp_path, glob = "*water*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_water
#> # A tibble: 6 × 4
#> paint thickness date_time temperature
#> <chr> <chr> <chr> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 22.3
#> 2 epoxy1 10 20/07/2022 13:56 22.6
#> 3 epoxy1 10 20/07/2022 14:06 22.5
#> 4 otherpaint 20 20/07/2022 13:10 24.5
#> 5 otherpaint 20 20/07/2022 13:20 24.6
#> 6 otherpaint 20 20/07/2022 14:30 24.8
# or implement as a function ----------------------------------------------
load_files <- function(csv_path, glob = "*.csv") {
return(
dir_ls(csv_path, glob = glob, type = "file") %>%
# store filenames in filename column
read_csv(id = "filename", show_col_types = FALSE) %>%
# extract each regex group to its own column
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv"))
}
# join / group / split ----------------------------------------------------
t_solid <- load_files(tmp_path, "*solid*.csv")
t_water <- load_files(tmp_path, "*water*.csv")
# either join by multiple columns or select only required cols
# to avoid x.* & y.* columns in result
inner_join(t_solid, t_water, by = c("date_time", "paint", "thickness")) %>%
group_by(paint) %>%
group_split()
Final result as a list of tibbles:
#> <list_of<
#> tbl_df<
#> paint : character
#> thickness : character
#> date_time : character
#> frequency : double
#> index : double
#> temperature: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1 22.3
#> 2 epoxy1 10 20/07/2022 13:56 31000 2 22.6
#> 3 epoxy1 10 20/07/2022 14:06 32000 3 22.5
#>
#> [[2]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 otherpaint 20 20/07/2022 13:10 20000 1 24.5
#> 2 otherpaint 20 20/07/2022 13:20 21000 2 24.6
#> 3 otherpaint 20 20/07/2022 14:30 22000 3 24.8

Calculating distance between all locations to first location, by group

I have GPS locations from several seabird tracks, each starting from colony x. Therefore the individual tracks all have similar first locations. For each track, I would like to calculate the beeline distance between each GPS location and either (a) a specified location that represents the location of colony x, or (b) the first GPS point of a given track which represents the location of colony x. For (b), I would look to use the first location of each new track ID (track_id).
I have looked for appropriate functions in geosphere, sp, raster, adehabitatLT, move, ... and just cannot seem to find what I am looking for.
I can calculate the distance between successive GPS points, but that is not what I need.
package(dplyr)
df %>%
group_by(ID) %>%
mutate(lat_prev = lag(Lat,1), lon_prev = lag(Lon,1) ) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(lon_prev, lat_prev), ncol = 2), # or use distHaversine
matrix(c(Lon, Lat), ncol = 2)))
#example data:
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
Grateful for any pointers.
You were pretty close. The key is that you want to calcualte the distance from the first observation in each track. Therefore you need to first adorn with the order in each track (easy to do with dplyr::row_number()). Then for the distance calculation, make the reference observation always the first by subsetting with order == 1.
library(tidyverse)
library(geosphere)
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
df %>%
group_by(ID) %>%
mutate(order = row_number()) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(Lon[order == 1], Lat[order == 1]), ncol = 2),
matrix(c(Lon, Lat), ncol = 2)))
#> # A tibble: 26 x 5
#> # Groups: ID [2]
#> Lon Lat ID order dist
#> <dbl> <dbl> <chr> <int> <dbl>
#> 1 -96.8 25.4 ID1 1 0
#> 2 -96.6 25.9 ID1 2 60714.
#> 3 -96.9 26.6 ID1 3 131665.
#> 4 -96.1 27.6 ID1 4 257404.
#> 5 -92.8 29.0 ID1 5 564320.
#> 6 -90.9 28.2 ID1 6 665898.
#> 7 -90.1 28.2 ID1 7 732131.
#> 8 -84.6 26.7 ID1 8 1225193.
#> 9 -83.7 25.1 ID1 9 1319482.
#> 10 -82 24.5 ID1 10 1497199.
#> # ... with 16 more rows
Created on 2022-01-09 by the reprex package (v2.0.1)
This also seems to work (sent to me by a friend) - very similar to Dan's suggestion above, but slightly different
library(geosphere)
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Dist_to_col = distHaversine(c(Lon[1], Lat[1]),cbind(Lon,Lat)))

How can I delineate multiple watersheds in R using the streamstats package?

There is an R package in development that I would like to use called streamstats. What it does is delineate a watershed (within the USA) for a latitude & longitude point along a body of water and provides watershed characteristics such as drainage area and proportions of various land covers. What I would like to do is extract some watershed characteristics of interest from a data frame of several lat & long positions.
I can get the package to do what I want for one point
devtools::install_github("markwh/streamstats")
library(streamstats)
setTimeout(120)
dat1 <- data.frame(matrix(ncol = 3, nrow = 3))
x <- c("state","lat","long")
colnames(dat1) <- x
dat1$state <- c("NJ","NY","VA")
dat1$lat <- c(40.99194,42.02458,38.04235)
dat1$long <- c(-74.28000,-75.11928,-79.88144)
test_dat <- dat1[1,]
ws1 <- delineateWatershed(xlocation = test_dat$long, ylocation = test_dat$lat, crs = 4326,
includeparameters = "true", includeflowtypes = "true")
chars1 <- computeChars(workspaceID = ws1$workspaceID, rcode = "MA")
chars1$parameters
However what I would like is to be able to give the delineateWatershed function several watersheds at once (i.e., all 3 locations found in dat1) and combine the chars1$parameters output variables DRNAREA,FOREST,LC11DEV, and LC11IMP into a data frame. Maybe this could be achieved with a for loop?
The ideal output would look like this
state lat long DRNAREA FOREST LC11DEV LC11IMP
1 NJ 40.99194 -74.28000 160 66.2 26.20 5.50
2 NY 42.02458 -75.11928 457 89.3 2.52 0.18
3 VA 38.04235 -79.88144 158 NA 4.63 0.20
I would put what you have in a function then use purrr::pmap_df() to loop through each row in dat1 then bind all the results together. See also this answer
library(dplyr)
library(purrr)
library(tidyr)
library(streamstats)
setTimeout(120)
dat1 <- data.frame(matrix(ncol = 3, nrow = 2))
colnames(dat1) <- c("state", "lat", "long")
dat1$state <- c("NJ", "NY")
dat1$lat <- c(40.99194, 42.02458)
dat1$long <- c(-74.28000, -75.11928)
dat1
#> state lat long
#> 1 NJ 40.99194 -74.28000
#> 2 NY 42.02458 -75.11928
Define a function for catchment delineation
catchment_delineation <- function(rcode_in, lat_y, long_x) {
print(paste0("Processing for lat = ", lat_y, " and long = ", long_x))
ws <- delineateWatershed(xlocation = long_x, ylocation = lat_y, crs = 4326,
includeparameters = "true", includeflowtypes = "true")
ws_properties <- computeChars(workspaceID = ws$workspaceID, rcode = rcode_in)
# keep only what we need
ws_properties_df <- ws_properties$parameters %>%
filter(code %in% c("DRNAREA", "FOREST", "LC11DEV", "LC11IMP")) %>%
mutate(ID = ws$workspaceID,
state = rcode_in,
long = long_x,
lat = lat_y)
return(ws_properties_df)
}
Apply the function to each row in dat1 data frame
catchment_df <- pmap_df(dat1, ~ catchment_delineation(..1, ..2, ..3))
#> https://streamstats.usgs.gov/streamstatsservices/watershed.geojson?rcode=NJ&xlocation=-74.28&ylocation=40.99194&includeparameters=true&includeflowtypes=true&includefeatures=true&crs=4326https://streamstats.usgs.gov/streamstatsservices/parameters.json?rcode=NJ&workspaceID=NJ20210923064141811000&includeparameters=truehttps://streamstats.usgs.gov/streamstatsservices/watershed.geojson?rcode=NY&xlocation=-75.11928&ylocation=42.02458&includeparameters=true&includeflowtypes=true&includefeatures=true&crs=4326https://streamstats.usgs.gov/streamstatsservices/parameters.json?rcode=NY&workspaceID=NY20210923064248530000&includeparameters=true
catchment_df
#> ID name
#> 1 NJ20210923064141811000 Drainage Area
#> 2 NJ20210923064141811000 Percent Forest
#> 3 NJ20210923064141811000 Percent Developed from NLCD2011
#> 4 NJ20210923064141811000 Percent_Impervious_NLCD2011
#> 5 NY20210923064248530000 Drainage Area
#> 6 NY20210923064248530000 Percent Forest
#> 7 NY20210923064248530000 Percent Developed from NLCD2011
#> 8 NY20210923064248530000 Percent_Impervious_NLCD2011
#> description
#> 1 Area that drains to a point on a stream
#> 2 Percentage of area covered by forest
#> 3 Percentage of developed (urban) land from NLCD 2011 classes 21-24
#> 4 Average percentage of impervious area determined from NLCD 2011 impervious dataset
#> 5 Area that drains to a point on a stream
#> 6 Percentage of area covered by forest
#> 7 Percentage of developed (urban) land from NLCD 2011 classes 21-24
#> 8 Average percentage of impervious area determined from NLCD 2011 impervious dataset
#> code unit value state long lat
#> 1 DRNAREA square miles 160.00 NJ -74.28000 40.99194
#> 2 FOREST percent 66.20 NJ -74.28000 40.99194
#> 3 LC11DEV percent 26.20 NJ -74.28000 40.99194
#> 4 LC11IMP percent 5.50 NJ -74.28000 40.99194
#> 5 DRNAREA square miles 457.00 NY -75.11928 42.02458
#> 6 FOREST percent 89.30 NY -75.11928 42.02458
#> 7 LC11DEV percent 2.52 NY -75.11928 42.02458
#> 8 LC11IMP percent 0.18 NY -75.11928 42.02458
Reshape the result to desired format
catchment_reshape <- catchment_df %>%
select(state, long, lat, code, value) %>%
pivot_wider(names_from = code,
values_from = value)
catchment_reshape
#> # A tibble: 2 x 7
#> state long lat DRNAREA FOREST LC11DEV LC11IMP
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 NJ -74.3 41.0 160 66.2 26.2 5.5
#> 2 NY -75.1 42.0 457 89.3 2.52 0.18
Created on 2021-09-22 by the reprex package (v2.0.1)
Since you mentioned the use a for loop I thought why not make a solution of it.
Here is your data:
library(dplyr)
library(purrr)
library(tidyr)
library(streamstats)
setTimeout(120)
dat1 <- data.frame(matrix(ncol = 3, nrow = 2))
colnames(dat1) <- c("state", "lat", "long")
dat1$state <- c("NJ", "NY")
dat1$lat <- c(40.99194, 42.02458)
dat1$long <- c(-74.28000, -75.11928)
dat1
Create an empty list to store the watershed characteristics:
water_shed <- list()
Loop through the dat1 and return the properties for each respective longitude and latitude:
for(i in 1:nrow(dat1)){
water_shed[[i]] <-
delineateWatershed(xlocation = dat1$long[i], ylocation = dat1$lat[i], crs = 4326,
includeparameters = "true", includeflowtypes = "true")
}
Now create a list to store the watershed properties:
ws_properties <- list()
Loop through the water_shed returning the parameters of each location:
for(i in 1:length(water_shed)){
ws_properties[[i]] <- computeChars(workspaceID = water_shed[[i]][[1]], rcode = dat1$state)
}
Finally, create a dataframe for your desired outputs then append the properties for each location looping through the list of watershed properties:
# data frame:
ws_properties_df <- data.frame(state=character(),long=integer(), lat=integer(),
DRNAREA = integer(), FOREST = integer(), LC11DEV = integer(), LC11IMP = integer(),
stringsAsFactors=FALSE)
#append properties for eact location
for(i in 1:length(ws_properties)){
ws_properties_df[i,] <- ws_properties[[i]]$parameters %>%
filter(code %in% c("DRNAREA", "FOREST", "LC11DEV", "LC11IMP")) %>%
mutate(state = dat1$state[i],
long = dat1$long[i],
lat = dat1$lat[i]) %>%
select(state, long, lat, code, value) %>%
pivot_wider(names_from = code,
values_from = value)
}
Desired Output:

How to create rate on R

I want to change my data so that it gives me the rate of pedestrians to that states population. I am using a linear model and my summary values look like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.087061 0.029876 2.914 0.00438 **
intersection 0.009192 0.003086 2.978 0.00362 **
Here, my beta value intersection is .009192 and that is not meaningful because compared to a state that has a smaller population, this value might be nothing in comparison.
Below is a condensed version of my data without all the columns I use, but here is the link of the csv incase someone wants to download it from there.
> head(c)
# A tibble: 6 x 15
STATE STATENAME PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME CITY DAY MONTH YEAR LATITUDE LONGITUD
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Alabama 0 3 3 0 81 LEE (81) 2340 7 2 2019 32.7 -85.3
2 1 Alabama 0 2 2 0 55 ETOWAH (55) 1280 23 1 2019 34.0 -86.1
3 1 Alabama 0 4 4 0 29 CLEBURNE (29) 0 22 1 2019 33.7 -85.4
4 1 Alabama 1 1 1 1 55 ETOWAH (55) 2562 22 1 2019 34.0 -86.1
5 1 Alabama 0 1 1 0 3 BALDWIN (3) 0 18 1 2019 30.7 -87.8
6 1 Alabama 0 2 2 0 85 LOWNDES (85) 0 7 1 2019 32.2 -86.4
# … with 1 more variable: FATALS <dbl>
Here is the code I have that runs through the process I am doing. I don't see how I can change it so that each value is a rate (values like peds or type_int)
#Libraries
rm(list=ls()) # this is to clear anything in memory
library(leaflet)
library(tidyverse)
library(ggmap)
library(leaflet.extras)
library(htmltools)
library(ggplot2)
library(maps)
library(mapproj)
library(mapdata)
library(zoo)
library(tsibble)
setwd("~/Desktop/Statistics790/DataSets/FARS2019NationalCSV")
df <- read.csv("accident.csv")
state <- unique(df$STATE)
for(i in state){
df1<- df %>%
filter(STATE==i) %>%
dplyr::select(c(STATE,PEDS,DAY,MONTH,YEAR,TYP_INT)) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d"))) %>% # create a date
group_by(date) %>% # Group by State id and date
# summarise_at(.vars = vars(PEDS), sum)
summarise(pedday=sum(PEDS),intersection=mean(TYP_INT))
#ts1<-ts(df,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts/figures")
plots<-df1 %>%
ggplot()+
geom_line(aes(x=date,y=pedday))+ylim(0,13)+
theme_bw()
ggsave(paste0("state_",i,".png"),width=8,height=6, )
ts1<-ts(df1,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts")
ts1 %>% write.csv(paste0("state_",i,".csv"),row.names = F)
#Plots
}
#date1<- as.character(df$date)
#df1<- df%>% filter(STATE=="1")
#ts2<-xts(df,order.by = as.Date(df$date,"%Y-%m-%d"))
setwd("~/Desktop/Statistics790/States_ts")
cat("\f")
#df <- read.csv(paste0("state_1.csv"))
#print("------Linear Model------")
#summary(lm(pedday~weather,data=df))
for(i in state){
print(paste0("-------------------------Analysis for State: ",i," -------------------------------"))
df <- read.csv(paste0("state_",i,".csv"))
print("------Linear Model------")
print(summary(lm(pedday~intersection,data=df)))
}
Collating my answers from the comments: you need to get state population data from an outside source such as the US Census https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html#par_textimage_1574439295, read it in, join it to your dataset, and then calculate rate as pedestrians per population, scaled for ease of reading on the graph. You can make your code faster by taking some of your calculations out of the loop. The code below assumes the census data is called 'census.csv' and has columns 'Geographic Area' for state and 'X2019' for the most recent population data available.
pop <- read.csv('census.csv')
df <- read.csv('accidents.csv') %>%
left_join(pop, by = c('STATENAME' = 'Geographic Area') %>%
mutate(rate = (PEDS / X2019) * <scale>) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d")))
The left_join will match state names and give each row a population value depending on its state, regardless of how many rows there are.

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

Resources