How to convert date to lunar date in R? - r

I have fisheries data set ( sample data set) . I'm going to study moons impact on the fish catch. I used lunar package to find moon phase of each fishing day.
library(lunar)
data$lunar_phase <- lunar.phase(as.Date(data$fdate))
output as follows
fdate lunar_phase
29/3/2006 3.51789248
28/3/2006 1.255536876
24/3/2006 4.559716361
26/3/2006 2.801242263
25/3/2006 0.538886659
lunar package can be used to categorize lunar phase into 4 or 8 periods.
I need to convert the fishing date to relative lunar cycle date. Lunar cycle is 29.53 days. If lunar day 0 = full moon, then find lunar cycle dates of other dates.
Is there any possible way to do that?
Expected output may be as follows
fdate lunar_day
29/3/2006 6
28/3/2006 4
24/3/2006 10
26/3/2006 5
25/3/2006 1

I don't know of a package that calculates "lunar day" from a date. In theory you could do it from your dataset, by identifying the maximum and minimum values for phase, then converting phases to percentages and rounding up as a proportion of 29.53.
However, the lunar package also calculates illumination (as a fraction of visible surface). I think this is a good proxy for lunar day and also gives you a physical value, rather than something more arbitrary.
Using your data, it's clear that new moon occurs near the start of the month:
library(tidyverse)
library(lunar)
sample_data <- read_csv("sample_data.csv")
sample_data %>%
mutate(Date = as.Date(fdate, "%d/%m/%Y"),
illum = lunar.illumination.mean(Date)) %>%
ggplot(aes(Date, illum)) + geom_point()
We could also fill in the missing dates, which makes the lunar cycle apparent:
all_dates <- data.frame(Date = seq.Date(min(as.Date(sample_data$fdate, "%d/%m/%Y")),
max(as.Date(sample_data$fdate, "%d/%m/%Y")),
by = "1 day")) %>%
mutate(illum = lunar.illumination.mean(Date))
all_dates %>%
ggplot(aes(Date, illum)) + geom_point()
Now, assuming your dataset has a column named catch, we could begin analysis by joining the catch data with the full date range, then plotting catch and lunar illumination. This dataset may also be used for regression, correlation etc.
# simulated catch data
set.seed(123)
sample_data <- sample_data %>%
mutate(catch = rnorm(16, 100, 30))
all_dates %>%
left_join(mutate(sample_data, Date = as.Date(fdate, "%d/%m/%Y"))) %>%
select(Date, illum, catch) %>%
gather(variable, value, -Date) %>%
ggplot(aes(Date, value)) +
geom_point() +
facet_grid(variable~., scales = "free_y")

For each subsequent day from new moon until full moon and the vice versa, the lunar phase increases by 0.212769. So when you run the code below, you get an approximated lunar day :
library(lunar)
lunar_day <- lunar.phase(as.Date("2020-07-21")) /(0.212769)
round(lunar_day, 0)
This works fine , approximately, to 29 days and starting again with new moon. A tidyverse code is given below:
library(lunar)
library(tidyverse)
new_data <-
data %>%
mutate(lunar_day = lunar.phase(as.Date(fdate)) / 0.212769) %>%
mutate_if(is.numeric(lunar_day),round, 0)
Post 14th day, it is the cycle from full-moon to new moon. A function can be written where the cycle can be represented as {new, wax-1, wax-2,...., full, wane-1, wane-2,...., new}.

Related

Creating a Cumulative Sum Plot using ggplot with duplicate x values

In my hypothetical example, people order ice-cream at a stand and each time an order is placed, the month the order was made and the number of orders placed is recorded. Each row represents a unique person who placed the order. For each flavor of ice-cream, I am curious to know the cumulative orders placed over the various months. For instance if a total of 3 Vanilla orders were placed in April and 4 in May, the graph should show one data point at 3 for April and one at 7 for May.
The issue I am running into is each row is being plotted separately (so there would be 3 separate points at April as opposed to just 1).
My secondary issue is that my dates are not in chronological order on my graph. I thought converting the Month column to Date format would fix this but it doesn't seem to.
Here is my code below:
library(lubridate)
Flavor <- c("Vanilla", "Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","chocolate","chocolate","chocolate")
Month <- c("1-Jun-21", "1-May-19", "1-May-19","1-Apr-19", "1-Apr-19","1-Apr-19","1-Apr-19", "1-Mar-19", "1-Mar-19", "1-Mar-19","1-Mar-19", "1-Apr-19", "1-Mar-19", " 1-Apr-19", " 1-Jan-21", "1-May-19", "1-May-19","1-May-19","1-May-19","1-Jun-19","2-September-19", "1-September-19","1-September-19","1-December-19","1-May-19","1-May-19","1-Jun-19")
Orders <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
data <- data.frame(Flavor,Month,Orders)
data$Month <- dmy(data$Month)
str(data)
data2 <- data[data$Flavor == "Vanilla",]
ggplot(data=data2, aes(x=Month, y=cumsum(Orders))) + geom_point()
In these situations, it's usually best to pre-compute your desired summary and send that to ggplot, rather than messing around with ggplot's summary functions. I've also added a geom_line() for clarity.
data %>%
group_by(Flavor, Month) %>%
summarize(Orders = sum(Orders)) %>%
group_by(Flavor) %>%
arrange(Month) %>%
mutate(Orders = cumsum(Orders)) %>%
ggplot(data = ., aes(x=Month, y=Orders, color = Flavor)) + geom_point() + geom_line()

R - Draw cases per 100k population

I try to draw line COVID cases for each date. I do not have output, the lecturer gave just questions. I solved the question but my problem is the output. It looks weird. Here is the question:
"For the ten countries with the highest absolute number of total deaths, draw the following line graphs to visualize whether epidemic has started to slow down and how the growth rate of new cases/deaths differs across those countries.
a) Number of new cases at each date (absolute number vs per 100.000 population)"
Here is my codes:
library(utils)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
#Finding ten countries where the highest absolute total deaths number is
abs_total_deaths <-COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(abs_total_deaths = sum(deaths)) %>%
arrange(desc(abs_total_deaths))
abs_ten_total_deaths <- c('Italy','France','Germany','Spain','Poland',
'Romania','Czechia','Hungary','Belgium','Bulgaria')
#Calculate new cases by dividing absolute number to 100.000 population
#Draw line for each country
COVID_data %>%
filter(countriesAndTerritories %in% abs_ten_total_deaths) %>%
filter(cases >0) %>%
mutate(new_cases = cases/100000) %>%
ungroup() %>%
ggplot()+
geom_line(aes(x = dateRep, y = new_cases, color = countriesAndTerritories),size=1)+
labs(x="Date",
y="New Cases",
title="New Cases per 100.000 population") +
facet_wrap(~countriesAndTerritories)+
theme_bw()
I will also add a pic of my output. I think my graph is not correct, because the output looks really weird. I can't understand where I make a mistake. If you help me, I'll be appreciated that.
Here is the output:
Looking at Belgium, I get total deaths = 25051 from your data file, which tallies exactly with the data here.
It's obvious that the highest value (by far) for every country occurs "on" the earliest date for the country in the file. Amongst your top ten (I agree with your selection), this is 01Mar2021 for every country apart from Spain, and 28Feb2021 for Spain.
These two facts lead me to conclude (1) your graphs correctly display the data you have asked them to summarise and that (2) you have a data artefact: the first record for each country contains the cumulative total to date, whereas subsequent dates contain data reported "in the previous 24 hours". I use quotes because different countries have different reporting conventions. For example, in the UK (since August 2020) "COVID-related deaths" are deaths from any cause within 28 days of a positive COVID test. Citation
Therefore, to get meaningful graphs, I think your only option is to discard the cumulative data contained in the first record for each country. Here's how I would do that:
library(utils)
library(tidyverse)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
# For better printing
COVID_data <- as_tibble(COVID_data)
# Which countries have the higest absolute death toll?
# [I get the same countries as you do.]
top10 <- COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(TotalDeaths=sum(deaths)) %>%
slice_max(TotalDeaths, n=10) %>%
distinct(countriesAndTerritories) %>%
pull(countriesAndTerritories)
COVID_data %>%
filter(countriesAndTerritories %in% top10) %>%
mutate(
deathRate=100000 * deaths / popData2020,
caseRate=100000 * cases /popData2020,
Date=lubridate::dmy(dateRep)
) %>%
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
ggplot() +
geom_line(aes(x=Date, y=deathRate)) +
facet_wrap(~countriesAndTerritories)
The critical part that excludes the first data row for each country is
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
The call to arrange is necessary because the data are not in date order to begin with.
This gives the following plot
which is much more like what I (and I suspect, you) would expect.
The sawtooth patterns you see are most likely also reporting artefacts: deaths that take place over the weekend (or on public holidays) are not reported until the following Monday (or next working day). This is certainly true in the UK.

I need a method/function to efficiently convert weekly data to monthly data in R with fractional weeks

I have time series data of several products in one data frame.
Sample:
columns: Name size volume date
The volumes are measured every week
I wish to convert this to a monthly estimate in a mathematical way, i.e not simply taking whatever week is in that month rather, for example - Lets start at Jan 1st, i take the first four weeks plus 3/7th of the fourth week to make 31 days and for the second month I take 4/7th of the split week plus whatever i need and so on.
This is very tedious to do manually or even in a for loop.
Is there a smarter way of doing this?
Please help
I had an idea but unsure of implementation, to spline the data and simply sum everything at the predicted end of the month. How can I do this?
Divide each week into the number of days and assume each day has the same amount of volume. Then group by month and sum.
library(lubridate)
library(tidyverse)
data <- tibble(product = c(rep("a", 30), rep("b", 30)),
week = rep(ymd("2020-09-24") + weeks(1:30),2),
volume = 1:60)
new <- tibble(date = seq(min(data$week), max(data$week), by ="days")) %>%
mutate(week = floor_date(date, unit = "weeks", week_start = 4),
month = floor_date(date, unit = "months")) %>%
left_join(data) %>%
group_by(product, week) %>%
mutate(volume = volume/n()) %>%
group_by(product, month) %>%
summarize(volume = sum(volume), .groups = "drop")
Make sure to change week_start to match whatever day your week starts on

How do i summarize values attributed to several variables in a data set?

First of all I have to describe my data set. It has three columns, where number 1 is country, number 2 is date (%Y-%m-%d), and number 3 is a value associated with each row (average hotel room prices). It continues like that in rows from 1990 to 2019. It works as such:
Country Date Value
France 2011-01-01 700
etc.
I'm trying to turn the date into years instead of the normal %Y-%m-%d format, so it will instead sum the average values for each country each year (instead of each month). How would I go about doing that?
I thought about summarizing the values totally for each country each year, but that is hugely tedious and takes a long time (plus the code will look horrible). So I'm wondering if there is a better solution for this problem that I'm not seeing.
Here is the task at hand so far. My dataset priceOnly shows the average price for each month. I have also attributed it to show only values not equal to 0.
diffyear <- priceOnly %>%
group_by(Country, Date) %>%
summarize(averagePrice = mean(Value[which(Value!=0.0)]))
You can use the lubridate package to extract years and then summarise accordingly.
Something like this:
diffyear <- priceOnly %>%
mutate(Year = year(Date)) %>%
filter(Value > 0) %>%
group_by(Country, Year) %>%
summarize(averagePrice = mean(Value, na.rm = TRUE))
And in general, you should always provide a minimal reproducible example with your questions.

Summarising weather data by day ( from package nycflights13 in R)

I would like to summarise hourly weather data by day (get the total precipitation and maximum wind speed daily). Found a code snippet on the web, but it results in only one observation for both variables, instead of daily observations.
How can I change this particular code? And what are the other ways exist to perform this task?
Thanks!
library(nycflights13)
library(dplyr)
precip <- weather %>%
group_by(month, day) %>%
filter(month < 13) %>%
summarise(totprecip = sum(precip), maxwind = max(wind_speed))

Resources