Convert Tables to Data Frames with Loop in R - r

I have over 100 tables. Each ID has multiple columns (ID, Date, Days, Mass, Float, Date 2, Days 2, pH).
I split the IDs from the data frame and made them the names of the tables as shown below.
data = NN
ID <- paste0("", NN$ID)
SD<- split(NN,ID)
SD
Each of the ID's look as follows
> SD$`4469912`
# A tibble: 5 × 8
ID Date Days Mass Float `Date 2` `Days 2` pH
<dbl> <dttm> <dbl> <dbl> <dbl> <dttm> <dbl> <chr>
1 4469912 2022-05-24 00:00:00 0 440 16.9 NA 0 NA
2 4469912 2022-05-27 00:00:00 3 813 NA NA 0 NA
3 4469912 2022-06-02 00:00:00 9 930 NA NA 0 NA
4 4469912 2022-06-03 00:00:00 10 914. NA NA 0 NA
5 4469912 2022-06-06 00:00:00 13 944 NA NA 0 NA
I would like to convert each ID to its own Dataframe as shown below
`4469912`<- data.frame(SD$`4469912`)
AKA
`4469912`<- data.frame(SD[9])
The problem I am running into is running a loop to create each table as its own data frame. I would like to name the data frames to their corresponding ID. Something along the lines of the code below.
for (x in SD) {
names(SD[x]) <- data.frame(SD[x])
}
EDIT: I will add that the end goal is to pull or select specific IDs to then plot them on top or against one another in ggplot as each ID is its own geom_line for example:
`4469912`<- data.frame(SD$`4469912`)
`4469822`<- data.frame(SD$`4469822`)
`4469222`<- data.frame(SD$`4469222`)
ggplot(data=NULL,aes(x=`Date`,y=`Mass`)) +
geom_line(data = `4469912`,aes(col="red"))+
geom_line(data = `4469822`,aes(col="blue"))+
geom_line(data = `4469222`,aes(col="green"))
Rather than plotting the entirety of my original data frame, I can determine falloff or regression between the IDs rather than the entirety of the data points selected; if that makes sense and/or is relevant.

Related

How can I calculate mean values for each day of an year from a time series data set in R?

I have a data set containing climatic data taken hourly from 01-01-2007 to 31-12-2021.
I want to calculate the mean value for a given variable (e.g. temperature) for each day of the year (1:365).
My dataset look something like this:
dia prec_h tc_h um_h v_d vm_h
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-01-01 0.2 22.9 89 42 3
2 2007-01-01 0.4 22.8 93 47 1.9
3 2007-01-01 0 22.7 94 37 1.3
4 2007-01-01 0 22.6 94 38 1.6
5 2007-01-01 0 22.7 95 46 2.3
[...]
131496 2021-12-31 0.0 24.7 87 47 2.6
( "[...]" stands for sequence of data from 2007 - 2014).
I first calculated daily mean temperature for each of my entry dates as follows:
md$dia<-as.Date(md$dia,format = "%d/%m/%Y")
m_tc<-aggregate(tc_h ~ dia, md, mean)
This returned me a data frame with mean temperature values for each analyzed year.
Now, I want to calculate the mean temperature for each day of the year from this data frame, i.e: mean temperature for January 1st up to December 31st.
Thus, I need to end up with a data frame with 365 rows, but I don't know how to do such calculation. Can anyone help me out?
Also, there is a complication: I have 4 leap years in my data frame. Any recommendations on how to deal with them?
Thankfully
First simulate a data set with the relevant columns and number of rows, then aggregate by day giving m_tc.
As for the question, create an auxiliary variable mdia by formating the dates column as month-day only. Compute the means grouping by mdia. The result is a data.frame with 366 rows and 2 columns as expected.
set.seed(2022)
# number of rows in the question
n <- 131496L
dia <- seq(as.Date("2007-01-01"), as.Date("2021-12-31"), by = "1 day")
md <- data.frame(
dia = sort(sample(dia, n, TRUE)),
tc_h = round(runif(n, 0, 40), 1)
)
m_tc <- aggregate(tc_h ~ dia, md, mean)
mdia <- format(m_tc$dia, "%m-%d")
final <- aggregate(tc_h ~ mdia, m_tc, mean)
str(final)
#> 'data.frame': 366 obs. of 2 variables:
#> $ mdia: chr "01-01" "01-02" "01-03" "01-04" ...
#> $ tc_h: num 20.2 20.4 20.2 19.6 20.7 ...
head(final, n = 10L)
#> mdia tc_h
#> 1 01-01 20.20741
#> 2 01-02 20.44143
#> 3 01-03 20.20979
#> 4 01-04 19.63611
#> 5 01-05 20.69064
#> 6 01-06 18.89658
#> 7 01-07 20.15992
#> 8 01-08 19.53639
#> 9 01-09 19.52999
#> 10 01-10 19.71914
Created on 2022-10-18 with reprex v2.0.2
You can pass your data to the function using the pipe (%>%) from R package (magrittr) and calculate the mean values by calling R package (dplyr):
library(dplyr); library(magrittr)
tcmean<-md %>% group_by(dia) %>% summarise(m_tc=mean(tc_h))

Adding points to geom_hline

I am trying to plot a timeseries with a geom_hline using the FL_Actions and the point data for that line. So far, I've been able to add the geom_hline but I am having problems adding the geom_point with the labels for FL_Action where 1 is equal to closing down policies and 2 is equal to opening up policies. I am using this dataset (see sample below):
# A tibble: 22 x 10
Date Date2 Date3 FLORIDA FLday MICHIGAN MIday FL_Actions MI_Actions realdate
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
1 3/6/20 3/6/20 3/6/20 3 0 0 0 NA NA 2020-03-06
2 3/7/20 3/7/20 3/7/20 7 4 0 0 NA NA 2020-03-07
3 3/8/20 3/8/20 3/8/20 10 3 0 0 NA NA 2020-03-08
4 3/9/20 3/9/20 3/9/20 13 3 0 0 1 NA 2020-03-09
5 3/10/20 3/10/20 3/10/20 15 2 0 0 NA 1 2020-03-10
6 3/11/20 3/11/20 3/11/20 24 9 2 2 1 NA 2020-03-11
7 3/12/20 3/12/20 3/12/20 30 6 3 1 NA NA 2020-03-12
8 3/13/20 3/13/20 3/13/20 45 15 22 19 NA NA 2020-03-13
9 3/14/20 3/14/20 3/14/20 64 19 35 13 NA NA 2020-03-14
10 3/15/20 3/15/20 3/15/20 100 36 45 10 NA NA 2020-03-15
# … with 12 more rows
This is my current code:
ggplot(MI_FL_Data, aes(realdate, FLday))+
geom_line() +
geom_hline(aes(yintercept = 15000), data=MI_FL_Data, linetype=2) +
geom_hline(aes(yintercept=17000), data=MI_FL_Data, linetype=4) +
geom_point(aes(col=8, 15000)) +
geom_point(aes(col=8,17000)) +
labs(x=NULL, y="Number of Reported Daily COVID Cases", title="State of Florida") +
theme_classic()
However, I keep getting an error which says: Error: Invalid input: date_trans works with objects of class Date only.
I think I need to add the realdate (I used library(lubridate) to create the realdate variable), but want only dates that are coded 1 to go on one geom_hline and those coded 2 to go on the other geom_hline. Would this look something like:
geom_point(aes(realdate, col=8 if.1) yintercept 17000)
or something like that?
The line geom_point(aes(col=8, 15000)) generates an error. I am not sure what is that supposed to do. Also I don't get why your yintercept is as high as 15000,17000 when the values on y-axis are in the range 0-36 (atleast for the data shown).
This works but I don't know if you had this in mind.
library(ggplot2)
ggplot(MI_FL_Data, aes(realdate, FLday))+
geom_line() +
geom_hline(aes(yintercept = 15), linetype=2) +
geom_hline(aes(yintercept=17), linetype=4) +
labs(x=NULL, y="Number of Reported Daily COVID Cases",
title="State of Florida") +
theme_classic()
It appears your date columns are actually formatted as characters. Can you change them to date and see if that helps? You can mutate them and write over them with `as.Date(Date, format = "%m/%d/%y"). As for the second point, you could supply the data argument and filter inside the individual calls, for example.

Calculate number of negative values between two dates

I have a data frame of SPEI values. I want to calculate two statistics (explained below) at an interval of
20 years i.e 2021-2040, 2041-2060, 2061-2080, 2081-2100. The first column contains the Date (month-year), and
Each year i.e. 2021, 2022, 2023 etc. till 2100.
The statistics are:
Drought frequency: Number of times SPEI < 0 in the specified period (20 years and 1 year respectively)
Drought Duration: Equal to the number of months between its start (included) and end month (not included) of the specified period. I am assuming a drought event starts when SPEI < 0.
I was wondering if there's a way to do that in R? It seems like an easy problem, but I don't know how to do it. Please help me out. Excel is taking too long. Thanks.
> head(test, 20)
Date spei-3
1 2021-01-01 NA
2 2021-02-01 NA
3 2021-03-01 -0.52133737
4 2021-04-01 -0.60047887
5 2021-05-01 0.56838399
6 2021-06-01 0.02285012
7 2021-07-01 0.26288462
8 2021-08-01 -0.14314685
9 2021-09-01 -0.73132256
10 2021-10-01 -1.23389220
11 2021-11-01 -1.15874943
12 2021-12-01 0.27954143
13 2022-01-01 1.14606657
14 2022-02-01 0.66872986
15 2022-03-01 -1.13758050
16 2022-04-01 -0.27861017
17 2022-05-01 0.99992395
18 2022-06-01 0.61024314
19 2022-07-01 -0.47450485
20 2022-08-01 -1.06682997
Edit:
I very much like to add some code, but I don't know where to start.
test = "E:/drought.xlsx"
#Extract year and month and add it as a column
test$Year = format(test$Date,"%Y")
test$Month = format(test$Date,"%B")
I don't know how to go from here. I found that cumsum can help, but how do I select one year and then apply cumsum on it. I am not withholding code on purpose. I just don't know where or how to begin.
There are a couple questions the OP's post so I will go through them step by step. You'll need dplyr and lubridate for this workflow.
First, we create some fake data to use:
library(lubridate)
library(dplyr)
#create example data
dd<- data.frame(Date = seq.Date(as.Date("2021-01-01"), as.Date("2100-12-01"), by = "month"),
spei = rnorm(960,0,2))
That will look like this, similar to what you have above
> head(dd)
Date spei year year_20 drought
1 2021-01-01 -6.85689789 2021 2021_2040 1
2 2021-02-01 -0.09292459 2021 2021_2040 1
3 2021-03-01 0.13715922 2021 2021_2040 0
4 2021-04-01 2.26805601 2021 2021_2040 0
5 2021-05-01 -0.47325008 2021 2021_2040 1
6 2021-06-01 0.37034138 2021 2021_2040 0
Then we can use lubridate and cut to create our yearly and 20-year variables to group by later and create a column drought signifying if spei was negative.
#create a column to group on by year and by 20-year
dd <- dd %>%
mutate(year = year(Date),
year_20 = cut(year, breaks = c(2020,2040,2060,2080, 2100), include.lowest = T,
labels = c("2021_2040", "2041_2060", "2061_2080", "2081_2100"))) %>%
#column signifying if that month was a drought
mutate(drought = ifelse(spei<0,1,0))
Once we have that, we just use the group_by function to get frequency (or number of months with a drought) by year or 20-year period
#by year
dd %>%
group_by(year) %>%
summarise(year_freq = sum(drought)) %>%
ungroup()
# A tibble: 80 x 2
year year_freq
<dbl> <dbl>
1 2021 6
2 2022 4
3 2023 7
4 2024 6
5 2025 6
6 2026 7
#by 20-year group
dd %>%
group_by(year_20) %>%
summarise(year20_freq = sum(drought)) %>%
ungroup()
# A tibble: 4 x 2
year_20 year20_freq
<fct> <dbl>
1 2021_2040 125
2 2041_2060 121
3 2061_2080 121
4 2081_2100 132
Calculating drought duration is a bit more complicated. It involves
identifying the first month of each drought
calculating the length of each drought
combining information from 1 and 2 together
We can use lag to identify when a month changed from "no drought" to "drought". In this case we want an index of where the value in row i is different from that in row i-1
# find index of where values change.
change.ind <- dd$drought != lag(dd$drought)
#use index to find drought start
drought.start <- dd[change.ind & dd$drought == 1,]
This results in a subset of the initial dataset, but only with the rows with the first month of a drought. Then we can use rle to calculate the length of the drought. rle will calculate the length of every run of numbers, so we will have to subset to only those runs where the value==1 (drought)
#calculate drought lengths
drought.lengths <- rle(dd$drought)
# we only want droughts (values = 1)
drought.lengths <- drought.lengths$lengths[drought.lengths$values==1]
Now we can combine these two pieces of information together. The first row is an NA because there is no value at i-1 to compare the lag to. It can be dropped, unless you want to include that data.
drought.dur <- cbind(drought.start, drought_length = drought.lengths)
head(drought.dur)
Date spei year year_20 drought drought_length
NA <NA> NA NA <NA> NA 2
5 2021-05-01 -0.47325008 2021 2021_2040 1 1
9 2021-09-01 -2.04564549 2021 2021_2040 1 1
11 2021-11-01 -1.04293866 2021 2021_2040 1 2
14 2022-02-01 -0.83759671 2022 2021_2040 1 1
17 2022-05-01 -0.07784316 2022 2021_2040 1 1

Iteration for time series data, using purrr

I have a bunch of time series data stacked on top of one another in a data frame; one series for each region in a country. I'd like to apply the seas() function (from the seasonal package) to each series, iteratively, to make the series seasonally adjusted. To do this, I first have to convert the series to a ts class. I'm struggling to do all this using purrr.
Here's a minimum worked example:
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
For each region (indexed by a number) I'd like to perform the following operations. Here's the first region as an example:
tem1 <- df %>% filter(region==1)
tem2 <- ts(data = tem1$var, frequency = 4, start=c(1990,1))
tem3 <- seas(tem2)
tem4 <- as.data.frame(tem3$data)
I'd then like to stack the output (ie. the multiple tem4 data frames, one for each region), along with the region and quarter identifiers.
So, the start of the output for region 1 would be this:
final seasonaladj trend irregular region quarter
1 27 27 96.95 -67.97279 1 1
2 126 126 96.95 27.87381 1 2
3 124 124 96.95 27.10823 1 3
4 127 127 96.95 30.55075 1 4
5 173 173 96.95 75.01355 1 5
6 130 130 96.95 32.10672 1 6
The data for region 2 would be below this etc.
I started with the following but without luck so far. Basically, I'm struggling to get the time series into the tibble:
seas.adjusted <- df %>%
group_by(region) %>%
mutate(data.ts = map(.x = data$var,
.f = as.ts,
start = 1990,
freq = 4))
I don't know much about the seasonal adjustment part, so there may be things I missed, but I can help with moving your calculations into a map-friendly function.
After grouping by region, you can nest the data so there's a nested data frame for each region. Then you can run essentially the same code as you had, but inside a function in map. Unnesting the resulting column gives you a long-shaped data frame of adjustments.
Like I said, I don't have the expertise to know whether those last two columns having NAs is expected or not.
Edit: Based on #wibeasley's question about retaining the quarter column, I'm adding a mutate that adds a column of the quarters listed in the nested data frame.
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
df %>%
group_by(region) %>%
nest() %>%
mutate(data.ts = map(data, function(x) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(quarter = x$quarter)
})) %>%
unnest(data.ts)
#> # A tibble: 200 x 8
#> region final seasonaladj trend irregular quarter seasonal adjustfac
#> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 27 27 97.0 -68.0 1 NA NA
#> 2 1 126 126 97.0 27.9 2 NA NA
#> 3 1 124 124 97.0 27.1 3 NA NA
#> 4 1 127 127 97.0 30.6 4 NA NA
#> 5 1 173 173 97.0 75.0 5 NA NA
#> 6 1 130 130 97.0 32.1 6 NA NA
#> 7 1 6 6 97.0 -89.0 7 NA NA
#> 8 1 50 50 97.0 -46.5 8 NA NA
#> 9 1 135 135 97.0 36.7 9 NA NA
#> 10 1 105 105 97.0 8.81 10 NA NA
#> # ... with 190 more rows
I also gave a bit more thought to doing this without nesting, and instead tried doing it with a split. Passing that list of data frames into imap_dfr let me take each split piece of the data frame and its name (in this case, the value of region), then return everything rbinded back together into one data frame. I sometimes shy away from nested data just because I have trouble seeing what's going on, so this is an alternative that is maybe more transparent.
df %>%
split(.$region) %>%
imap_dfr(function(x, reg) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(region = reg, quarter = x$quarter)
}) %>%
select(region, quarter, everything()) %>%
head()
#> region quarter final seasonaladj trend irregular seasonal adjustfac
#> 1 1 1 27 27 96.95 -67.97274 NA NA
#> 2 1 2 126 126 96.95 27.87378 NA NA
#> 3 1 3 124 124 96.95 27.10823 NA NA
#> 4 1 4 127 127 96.95 30.55077 NA NA
#> 5 1 5 173 173 96.95 75.01353 NA NA
#> 6 1 6 130 130 96.95 32.10669 NA NA
Created on 2018-08-12 by the reprex package (v0.2.0).
I put all the action inside of f(), and then called it with purrr::map_df(). The re-inclusion of quarter is a hack.
f <- function( .region ) {
d <- df %>%
dplyr::filter(region == .region)
y <- d %>%
dplyr::pull(var) %>%
ts(frequency = 4, start=c(1990,1)) %>%
seas()
y$data %>%
as.data.frame() %>%
# dplyr::select(-seasonal, -adjustfac) %>%
dplyr::mutate(
quarter = d$quarter
)
}
purrr::map_df(1:10, f, .id = "region")
results:
region final seasonaladj trend irregular quarter seasonal adjustfac
1 1 27.00000 27.00000 96.95000 -6.797279e+01 1 NA NA
2 1 126.00000 126.00000 96.95000 2.787381e+01 2 NA NA
3 1 124.00000 124.00000 96.95000 2.710823e+01 3 NA NA
4 1 127.00000 127.00000 96.95000 3.055075e+01 4 NA NA
5 1 173.00000 173.00000 96.95000 7.501355e+01 5 NA NA
6 1 130.00000 130.00000 96.95000 3.210672e+01 6 NA NA
7 1 6.00000 6.00000 96.95000 -8.899356e+01 7 NA NA
8 1 50.00000 50.00000 96.95000 -4.647254e+01 8 NA NA
9 1 135.00000 135.00000 96.95000 3.671077e+01 9 NA NA
10 1 105.00000 105.00000 96.95000 8.806955e+00 10 NA NA
...
96 5 55.01724 55.01724 60.25848 9.130207e-01 16 1.9084928 1.9084928
97 5 60.21549 60.21549 59.43828 1.013076e+00 17 1.0462424 1.0462424
98 5 58.30626 58.30626 58.87065 9.904130e-01 18 0.1715082 0.1715082
99 5 61.68175 61.68175 58.07827 1.062045e+00 19 1.0537962 1.0537962
100 5 59.30138 59.30138 56.70798 1.045733e+00 20 2.5294523 2.5294523
...

Extend data frame column with inflation in R

I'm trying to extend some code to be able to:
1) read in a vector of prices
2) left join that vector of prices to a data frame of years (or years and months)
3) append/fill the prices for missing years with interpolated data based on the last year of available prices plus a specified inflation rate. Consider an example like this one:
prices <- data.frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA))
What I need is something that will fill the missing rows of each column with the last price plus inflation (suppose 2%). I can do this in a pretty brute force way as:
i_rate<-0.02
for(i in c(1:nrow(prices))){
if(is.na(prices$wti[i]))
prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
if(is.na(prices$brent[i]))
prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}
It seems to me there should be a way to do this using some combination of apply() and/or fill() but I can't seem to make it work.
Any help would be much appreciated.
As noted by #camille, the problem with dplyr::lag is that it doesn't work here with consecutive NAs because it uses the "original" ith element of a vector instead of the "revised" ith element. We'd have to first create a version of lag that will do this by creating a new function:
impute_inflation <- function(x, rate) {
output <- x
y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.
for (i in seq_len(length(output))) {
if (i == 1) {
y[i] <- output[i] #To avoid an error attempting to use the 0th element.
} else {
y[i] <- output[i - 1]
}
if (is.na(output[i])) {
output[i] <- y[i] * (1 + rate)
} else {
output[i]
}
}
output
}
Then it's a pinch to apply this across a bunch of variables with dplyr::mutate_at():
library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)
year wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86
You can use dplyr::lag to get the previous value in a given column. Your lagged values look like this:
library(dplyr)
inflation_factor <- 1.02
prices <- data_frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA)) %>%
mutate_at(vars(wti, brent), as.numeric)
prices %>%
mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#> year wti brent prev_wti
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 75 80 NA
#> 2 2019 80 85 75
#> 3 2020 90 94 80
#> 4 2021 NA 93 90
#> 5 2022 NA NA NA
When a value is NA, multiply the lagged value by the inflation factor. As you can see, that doesn't handle consecutive NAs, however.
prices %>%
mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#> year wti brent
#> <int> <dbl> <dbl>
#> 1 2018 75 80
#> 2 2019 80 85
#> 3 2020 90 94
#> 4 2021 91.8 93
#> 5 2022 NA 94.9
Or to scale this and avoid doing the same multiplication over and over, gather the data into a long format, get lags within each group (wti, brent, or any others you may have), and adjust values as needed. Then you can spread back to the original shape:
prices %>%
tidyr::gather(key = key, value = value, wti, brent) %>%
group_by(key) %>%
mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#> year brent wti
#> <int> <dbl> <dbl>
#> 1 2018 80 75
#> 2 2019 85 80
#> 3 2020 94 90
#> 4 2021 93 91.8
#> 5 2022 94.9 NA
Created on 2018-07-12 by the reprex package (v0.2.0).

Resources