Importing/Plotting a Time Series in R with two columns - r

I have RStudio and want to import a time series data set. The column on the x-axis should be the year, however when I use the ts.plot command it just plots Time on the x-axis. How can I make the years from the data set appear on my plot?
The data set is for Water Usage in NYC from 1898 to 1968. There are two columns, The Year and Water Usage.
This is the link to the data I used (I have donwnloaded the .TSV file)
https://datamarket.com/data/set/22tl/annual-water-use-in-new-york-city-litres-per-capita-per-day-1898-1968#!ds=22tl&display=line
These are the commands for importing my data:
nyc <- read.csv("~/Desktop/annual-water-use-in-new-york-cit.tsv", sep="")
View(nyc)
ts.plot(nyc)
This is what I get:

There are several ways to do this. I used the CSV file from your link in this demonstration.
library(tidyverse)
nyc <- read_csv("annual-water-use-in-new-york-cit.csv")
head(nyc)
# A tibble: 6 x 2
Year `Annual water use in New York city, litres per capita per day, 1898-1968`
<chr> <chr>
1 1898 402.8
2 1899 421.3
3 1900 431.2
4 1901 426.2
5 1902 425.5
6 1903 423.6
Method 1
Create a time series object and plot this time series.
Firstly, let us fix the column name of the annual water use so that it is easier to call in our code.
nyc <- nyc %>%
rename(
water_use = `Annual water use in New York city, litres per capita per day, 1898-1968`
)
Make the time series object nyc.ts with the ts() function.
nyc.ts <- ts(as.numeric(nyc$water_use), start = 1898)
You can then use the generic plot function to plot the time series.
plot(nyc.ts, xlab = "Years")
Method 2
Use the forecast::autoplot function. Note that this function is built on top of ggplot2.
autoplot(nyc.ts) + xlab("Years") + ylab("Amount in Litres")
Method 3
With just ggplot2:
nyc$Year <- as.POSIXct(nyc$Year, format = "%Y")
nyc$water_use <- as.numeric(nyc$water_use)
ggplot(nyc, aes(x = Year, y = water_use)) + geom_line() + xlab("Years") + ylab("Amount in Litres")

Related

Plot data over time using 2 variables of start date and end date

I have a dataset that has around 2000 rows.
Each row is a hospital encounter for ICU Admissions. This is data collected over 5 years
The variables of interest are: Encounter Number, Diagnosis Category, Admit Date, Discharge Date
What I want to do is try and plot the ICU occupancy for each day over these 5 years.
Example:
Encounter Number : 786786
Diagnosis Category : Tuberculosis
Admit Date : 2022-01-20
Discharge Date : 2022-01-30
Therefore this patient stayed in the ICU for 10 days starting from 01.20 to 01.30.
There will be other encounters for another diagnosis -
Encounter Number : 786786
Diagnosis Category : Cancer
Admit Date : 2022-01-21
Discharge Date : 2022-01-28
End goal is to plot the ICU occupancy for EACH date starting from the EARLIEST Admit Date and the LATEST Discharge Date (x - axis) by Diagnosis Category.
For each date on the x-axis for the 5 year time period, there will be a bar for the diagnosis category.
How can I go about doing this?
Thanks (:
I have encountered this problem myself many times before. The algorithm to count the occupancy is essentially just creating a vector of the days you want to plot, then for each day, counting how many people were admitted before that day and discharged after that day.
We need some realistic data. Given that you have 2000 admissions over the 5 years, and given mean ICU length of stay is typically 3.5 days with a gamma or lognormal type distribution, we can create some reasonable simulated data like this:
# Make data reproducible
set.seed(1)
df <- data.frame(Admit_date = sample(seq(as.POSIXct("2015-01-01"),
as.POSIXct("2020-01-01"), "day"),
2000, TRUE),
Diagnosis_category = sample(c("Respiratory",
"Infective",
"Post-op",
"Trauma"), 2000, TRUE),
Encounter_number = 56789123 + 1:2000)
df$Discharge_date <- df$Admit_date + 86400 * rgamma(2000, sh = 2, scale = 1.75)
df$Discharge_date <- as.Date(df$Discharge_date)
df$Admit_date <- as.Date(df$Admit_date)
df <- df[order(df$Admit_date), c(3, 1, 4, 2)]
rownames(df) <- NULL
head(df)
#> Encounter_number Admit_date Discharge_date Diagnosis_category
#> 1 56790418 2015-01-01 2015-01-02 Post-op
#> 2 56789614 2015-01-05 2015-01-10 Post-op
#> 3 56790100 2015-01-05 2015-01-12 Post-op
#> 4 56790644 2015-01-07 2015-01-07 Trauma
#> 5 56789943 2015-01-08 2015-01-09 Respiratory
#> 6 56790066 2015-01-08 2015-01-13 Trauma
Assuming this is similar to your own data, we can now count the occupancy for each day like this:
library(tidyverse)
# Create vector of all dates you wish to plot
days <- seq(as.Date("2015-01-01"), as.Date("2020-01-01"), "day")
plot_df <- df %>%
group_by(Diagnosis_category) %>%
summarize(date = days, count = sapply(days, function(x) {
sum(Admit_date <= x & Discharge_date >= x)
}))
Now we are ready to plot. In my example, we only have 4 diagnostic categories, and trying to plot over 1600 columns on a single panel is already challenging. If you try to put all your diagnostic categories over 5 years in a single panel, you will get a total mess. This is made worse by the fact that you will only ever have a handful of patients in each diagnostic category (other than during Covid peaks), so the plot will only have a few discrete steps in it. I think it would be best to use facets in this case:
ggplot(plot_df, aes(date, count, fill = Diagnosis_category,
color = Diagnosis_category)) +
geom_col() +
facet_wrap(.~Diagnosis_category) +
theme_minimal(base_size = 16) +
theme(legend.position = "none")
Unless there is a specific point you wish your data to make with this kind of plot (like massive occupancy spikes during Covid surges), you might want to think of a different summary measure. You could try grouping plot_df by diagnostic category and month, then calculating average monthly occupancy.

How to create a daily time series with monthly cycling patterns

I have a series of data for daily sales amount from 1/1/2018 to 10/15/2018, the example is shown as follows. It is already observed there are some monthly cycling patterns on the sales amount, say there is always a sales peak at the end of each month, and slight fluctuations in the amount in the middle of the month. Also, in general the sales in June, July and August is higher than that in other month. Now I need to predict the sales amount for the 10 days after 10/15/2018. I'm new to time series and ARIMA. Here I have two questions:
1. How to create such a daily time series and plot it with the date?
2. How can I set the cycle(or frequency) to show the monthly cycling pattern?
Date SalesAmount
1/1/2018 31,380.31
1/2/2018 384,418.10
1/3/2018 1,268,633.28
1/4/2018 1,197,742.76
1/5/2018 417,143.36
1/6/2018 693,172.65
1/8/2018 840,384.76
1/9/2018 1,955,909.69
1/10/2018 1,619,242.52
1/11/2018 2,267,017.06
1/12/2018 2,198,519.36
1/13/2018 584,448.06
1/15/2018 1,123,662.63
1/16/2018 2,010,443.35
1/17/2018 958,514.85
1/18/2018 2,190,741.31
1/19/2018 811,623.08
1/20/2018 2,016,031.26
1/21/2018 146,946.29
1/22/2018 1,946,640.57
As there isn't a reproducible example provided in the question, here's one that may help you visualize your data better.
Using the dataset: economics and library ggplot2, you can easily plot a timeseries.
library(ggplot2)
theme_set(theme_minimal())
# Basic line plot
ggplot(data = economics, aes(x = date, y = pop))+
geom_line(color = "#00AFBB", size = 2)
For your question, you just need to pass in x=Date and y=SalesAmount to obtain the plot below. To your 2nd question on predicting sales amount with timeseries, you can check out this question over here: Time series prediction using R
The first thing that you need before any kind of forecasting is to detect if you have any kind of seasonality. I recommend you to add more data as it is complex to determine if you have a repeated pattern with so few. Anyway you can try to determine the seasonality as follows:
library(readr)
test <- read_table2("C:/Users/Z003WNWH/Desktop/test.txt",
col_types = cols(Date = col_date(format = "%m/%d/%Y"),
SalesAmount = col_number()))
p<-periodogram(test$SalesAmount)
topF = data.table(freq=p$freq, spec=p$spec) %>% arrange(desc(spec))
1/topF
When you will add more data you can try to use ggseasonplot to visualize the different seasons.

Defining X axis by 2 parameters in a scatter plot

I am new to R, and I am working on graphing data that is spread out over the years 1963-2014. In my data, I have one column for the year (year), another for a month (month), and another for the concentration of magnesium in the water (Mg).
I am trying to make a scatter plot of how magnesium concentration has changed over time, but if I plot years on the x-axis and magnesium on the y, I end up with 12 points (one for each month) stacked on top of each other for every year. My data is called water2, and it produces
this graph.
Is there a way to ask R to spread these magnesium points out over the months and the years, essentially using two columns to define 1 x-axis? Alternatively, is there a way to create a new column that will define the years and months in one?
# dummy data
data <- data.frame(year = rep(1963:2014, each = 12),
month = rep(1:12, times = 52),
value = cumsum(rnorm(12*52)))
# convert it to a time-series object and plot it :
data.ts <- ts(data$value, start = 1963, frequency = 12)
plot.ts(data.ts, type = "p")
# Or you can ignore the time variables and just make a "index plot" with one variable :
plot(data$value, type = "p", xaxt = "n")
axis(1, at = seq(1, 12*52, by = 12), labels = 1963:2014)
# If you wanna merge year and month and generate a new variable :
data <- within(data, time <- paste(year, month, sep = "-"))
head(data)
year month value time
1 1963 1 -0.56389506 1963-1
2 1963 2 0.60636512 1963-2
3 1963 3 0.04645893 1963-3
4 1963 4 -0.76187300 1963-4
5 1963 5 -1.22781272 1963-5
6 1963 6 -2.33044086 1963-6

Subset data and plotting in R

I would like to use R to simplify and subset large datasets (over 100 000 values) and then plot them. Below is a simplified version of my dataset (Figure 1) where I broke it down into three years and two crop types. I have a Year (2011-2013), two crop types (Corn and Soybean) and their total Area.
I want to subset the data into the total Area of Corn and Soybean by year into a new table(example figure 2) with the year, type and total area and then plot the total area by year for each (example of plot in Figure 3).
Figure 1 Small example dataset
Figure 2 New total table
Figure 3 example of graph that I want to produce
I thought I could subset the data by year and crop with
corn2011 <- subset(CropTable, Year==2011 & Lulc=="Corn")
corn2012 <- subset(CropTable, Year==2012 & Lulc=="Corn")
and then I can summarize the data using the sum function
sum(corn2011[,3]),
but I'm not sure how to plot them yearly or against each other to have it look like Figure 3.
for your plot, you could try this
data.df <- read.table(text="
Year Type Area
1 2011 corn 30
2 2012 corn 15
3 2013 corn 50
4 2011 Soy 45
5 2012 Soy 30
6 2013 Soy 60",
header = TRUE)
ggplot(data=data.df, aes(x=as.factor(Year), y=Area, group=Type, color=Type)) + geom_line() + xlab("Year") + ylab("Area (ha)") + theme_bw() + scale_color_manual(values=c("red", "blue"))

trouble getting Date field on X axis using ggplot2

head(bktst.plotdata)
date method product type actuals forecast residual Percent_error month
1 2012-12-31 bauwd CUSTM NET 194727.51 -8192.00 -202919.51 -104.21 Dec12
2 2013-01-31 bauwd CUSTM NET 470416.27 1272.01 -469144.26 -99.73 Jan13
3 2013-02-28 bauwd CUSTM NET 190943.57 -1892.45 -192836.02 -100.99 Feb13
4 2013-03-31 bauwd CUSTM NET -42908.91 2560.05 45468.96 -105.97 Mar13
5 2013-04-30 bauwd CUSTM NET -102401.68 358807.48 461209.16 -450.39 Apr13
6 2013-05-31 bauwd CUSTM NET -134869.73 337325.33 472195.06 -350.11 May13
I have been trying to plot my back test result using ggplot2. Given above a sample dataset. I have dates ranging from Dec2012 to Jul2013. 3 levels in 'method', 5 levels in 'product' and 2 levels in 'type'
I tried this code, trouble is that R is not reading x-axis correct, on the X-axis I am getting 'Jan, feb, mar, apr, may,jun, jul, aug', instead I expect R to plot Dec-to-Jul
month.plot1 <- ggplot(data=bktst.plotdata, aes(x= date, y=Percent_error, colour=method))
facet4 <- facet_grid(product~type,scales="free_y")
title3 <- ggtitle("Percent Error - Month-over-Month")
xaxis2 <- xlab("Date")
yaxis3 <- ylab("Error (%)")
month.plot1+geom_line(stat="identity", size=1, position="identity")+facet4+title3+xaxis2+yaxis3
# Tried changing the code to this still not getting the X-axis right
month.plot1 <- ggplot(data=bktst.plotdata, aes(x= format(date,'%b%y'), y=Percent_error, colour=method))
month.plot1+geom_line(stat="identity", size=1, position="identity")+facet4+title3+xaxis2+yaxis3
Well, it looks like you are plotting the last day of each month, so it actually makes sense to me that December 31 is plotted very very close to January. If you look at the plotted points (with geom_point) you can see that each point is just to the left of the closest month axis.
It sounds like you want to plot years and months instead of actual dates. There are a variety of ways you might do this, but one thing you could is to change the day part of the date to the first of the month instead of the last of the month. Here I show how you could do this using some functions from package lubridate along with paste (I have assumed your variable date is already a Date object).
require(lubridate)
bktst.plotdata$date2 = as.Date(with(bktst.plotdata,
paste(year(date), month(date), "01", sep = "-")))
Then the plot axes start at December. You can change the format of the x axis if you load the scales package.
require(scales)
ggplot(data=bktst.plotdata, aes(x = date2, y=Percent_error, colour=method)) +
facet_grid(product~type,scales="free_y") +
ggtitle("Percent Error - Month-over-Month") +
xlab("Date") + ylab("Error (%)") +
geom_line() +
scale_x_date(labels=date_format(format = "%m-%Y"))

Resources