R plotting annual data and "January" repeated at end of graph - r

I'm fairly new to R and am trying to plot some expenditure data. I read the data in from excel and then do some manipulation on the dates
data <- read.csv("Spending2019.csv", header = T)
#converts time so R can use the dates
strdate <- strptime(data$DATE,"%m/%d/%Y")
newdate <- cbind(data,strdate)
finaldata <- newdate[order(strdate),]
This probably isn't the most efficient, but it gets me there :)
Here's the relevant columns of the first four lines of my finaldata dataframe
dput(droplevels(finaldata[1:4,c(5,7)]))
structure(list(AMOUNT = c(25.13, 14.96, 43.22, 18.43), strdate = structure(c(1546578000,
1546750800, 1547010000, 1547010000), class = c("POSIXct", "POSIXt"
), tzone = "")), row.names = c(NA, 4L), class = "data.frame")
The full data set has 146 rows and the dates range from 1/4/2019 to 12/30/2019
I then plot the data
plot(finaldata$strdate,finaldata$AMOUNT, xlab = "Month", ylab = "Amount Spent")
and I get this plot
This is fine for me getting started, EXCEPT why is JAN repeated at the far right end? I have tried various forms of xlim and can't seem to get it to go away.

Related

as.POSIXlt vs as.date and strptime

Basically, I have this date set for electric consumption per min in a household and I have a data with 9 columns, my data is:
https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption
so I tried two things and got two somewhat different output and I cant seem to figure out why is that:
first input:
hpc$Datetime<-as.POSIXlt(hpc$Datetime, format = "%d/%m/%Y %H:%M:%S")
with(hpc,plot(Datetime,Global.active.power, ylab = "Global.active.power(Killowatts)",
xlab = "",type = "l"))
second input:
hpc<-read.table("hpc.txt", skip = 66637, nrow = 2879, sep =";")
hpc$Time<-strptime(hpc$Time, format = "%H:%M:%S")
hpc$Date<-as.Date(hpc$Date, format = "%d/%m/%Y")
with(hpc,plot(Time,Global.active.power, ylab = "Global.active.power(Killowatts)",
xlab = "",type = "l"))
Why is there a line appearing in the second image
It will be a great if someone can be kind enough to help me out!!
Thankyou in advance

Filling gaps of time data with zero-values

In my data https://pastebin.com/CernhBCg I have irregular timestamps and a corresponding value. Additionally to the irregularity I have large gaps, for which I have no value in my data. I know however that for those gaps value is zero and I would like to fill up to gaps with rows with value=0. How can I do this?
Data
> dput(head(hub2_select,10))
structure(list(time = structure(c(1492033212.648, 1492033212.659,
1492033212.68, 1492033212.691, 1492033212.702, 1492033212.724,
1492033212.735, 1492033212.757, 1492033212.768, 1492033212.779
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), value = c(3,
28, 246, 297, 704, 798, 1439, 1606, 1583, 1572)), .Names = c("time",
"value"), row.names = c(NA, 10L), class = "data.frame")
Please take the file I provided to see the data and read it into R with
library(readr)
df <- read_csv("data.csv", col_types = list(time = col_datetime(), value = col_double()))
Solutions
For one the the values left and right of a gap are usually 0 or 1. So that might help. I thought I'd use a rolling join, but from I understand by now, this seems not be the way to go.
What works is
library(dplyr)
library(lubridate)
threshold_time = dseconds(2)
time_prev = df$time[1]
addrows = data.frame()
for (i in seq(2, nrow(df),1)){
time_current <- df$time[i]
if ((time_current - time_prev) > threshold_time){
time_add <- seq(time_prev, time_current, dseconds(0.1))
addrows = bind_rows(addrows, data.frame(time=time_add, value=rep(0, length(time_add))))
}
time_prev <- time_current
}
addrows$type <- 'filled'
df$type <- 'orig'
df_new <- bind_rows(df, addrows)
library(ggplot2)
ggplot(df_new, aes(time,value,color=type)) + geom_point()
But this solution is neither elegant nor efficient (I did not test efficiency though).
Honestly I haven't tried it yet (I had to switch to Python for other reasons and solved it there and didn't get around to try it out), but I am pretty sure https://cran.r-project.org/web/packages/padr/vignettes/padr.html would have been the answer. I just wanted to write this here for other readers with the same question.

R Error: index is not in increasing order

NOTE: PROBLEM RESOLVED IN THE COMMENTS BELOW
I'm getting the following error when trying to turn a data.frame into xts following the answer in found here.
Error in .xts(DA[, 3:6], index = as.POSIXct(DAINDEX, format = "%m/%d/%Y %H:%M:%S", :
index is not in increasing order
I've not been able to find much on this error or how to resolve it, so any help towards that would be greatly appreciated.
The data is daily S&P 500 in a comma delimited format with the following columns: "Date" "Time" "Open" "High" "Low" "Close".
Below is the code:
DA <- read.csv("SNP.csv", header = TRUE, stringsAsFactors = FALSE)
DAINDEX <- paste(DA$Date, DA$Time, sep = " ")
Data.hist <- .xts(DA[,3:6], index = as.POSIXct(DAINDEX, format = "%m/%d/%Y %H:%M:%S", tzone = "GMT"))
As requested, some lines of the data
structure(list(Date = c("5/20/2016", "5/19/2016", "5/18/2016",
"5/17/2016", "5/16/2016", "5/13/2016"), Time = c("0:00:00", "0:00:00",
"0:00:00", "0:00:00", "0:00:00", "0:00:00"), Open = c(2041.880005,
2044.209961, 2044.380005, 2065.040039, 2046.530029, 2062.5),
High = c(2058.350098, 2044.209961, 2060.610107, 2065.689941,
2071.879883, 2066.790039), Low = c(2041.880005, 2025.910034,
2034.48999, 2040.819946, 2046.530029, 2043.130005), Close = c(2052.320068,
2040.040039, 2047.630005, 2047.209961, 2066.659912, 2046.609985
)), .Names = c("Date", "Time", "Open", "High", "Low", "Close"
), row.names = c(NA, 6L), class = "data.frame")
The above is the output of dput(head(DA))
The easiest thing to do is use the regular xts constructor instead of .xts. It will check if the index is sorted correctly, and sort the index and data, if necessary.
Data.hist <- xts(DA[,3:6], as.POSIXct(DAINDEX, "%m/%d/%Y %H:%M:%S", "GMT"))

Convert date to month/year format for time series

I have some have some water quality sample data.
> dput(GrowingArealog90s[1:10,])
structure(list(SampleDate = structure(c(6948, 6949, 6950, 7516,
7517, 7782, 7783, 7784, 8092, 8106), class = "Date"), Flog90 = c(1.51851393987789,
1.48970743802793, 1.81243963000062, 0.273575501327576, 0.874218895695207,
1.89762709129044, 1.44012088794774, 0.301029995663981, 1.23603370361931,
0.301029995663981)), .Names = c("SampleDate", "Flog90"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
This data is collected monthly, although some months are missed over the 25 year period.
I know there is so much help out there for converting dates to different formats but I have not been able to figure this out. I want to create a time series with just a month/year format, so that I can do things like decompose the data by month and run seasonal kendalls and such. I have tried so many different ways of converting my date to the desired format that I have completely confused myself. I don't care about the exact format as long as it is recognized month/year.
I also need to fill in the missing months with NAs.
I tried uploading the "SampleDate" column in a numeric format, "yyyymm". I could then merge that data frame with another that contained all the dates I need.
GA90 <- merge(Dates, GrowingArealog90s, by.x = "Date", by.y = "Date", all.x = TRUE)
However, when I converted the resulting data frame to a time series it would not recognize the 12 month frequency.
GA90ts <- as.ts(GA90, frequency(12))
> GA90ts
Time Series:
Start = 1
End = 324
Frequency = 1
Any help with this is appreciated.
Here's how to do it with zoo. You'll get a warning, but it's OK for now. You'll get a series with mon/yy.
series <-structure(list(SampleDate = structure(c(6948, 6949, 6950, 7516,
7517, 7782, 7783, 7784, 8092, 8106), class = "Date"), Flog90 = c(1.51851393987789,
1.48970743802793, 1.81243963000062, 0.273575501327576, 0.874218895695207,
1.89762709129044, 1.44012088794774, 0.301029995663981, 1.23603370361931,
0.301029995663981)), .Names = c("SampleDate", "Flog90"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
library(zoo)
series <-as.data.frame(series) #to drop dplyr class
series.zoo <-zoo(series[,-1,drop=FALSE],as.yearmon(series[,1]))
Best practice would be to keep your series with actual date and use as.yearmon or as.yearmon only when you actually need to make calculations or aggregate.zoo by month and year.
The following is a matter of taste, but I've dealt with a lot of time series and I think zoo is superior to ts and xts. Much more flexible.
Now, to fill in missing values, you have to create a vector of dates. Here, I'm using a zoo object with actual dates. I then use na.locf, which is "last observation carry forward". You could also look at na.approx.
series.zoo <-zoo(series[,-1,drop=FALSE],(series[,1]))
my.seq <-seq.Date(first(series[,1,drop=FALSE]), last(series[,1,drop=FALSE]),by="month")
merged <-merge.zoo(series.zoo,zoo(,my.seq))
na.locf(merged)
UPDATE
With aggregate.
GrowingArealog90s <-structure(list(SampleDate = structure(c(6948, 6949, 6950, 7516,
7517, 7782, 7783, 7784, 8092, 8106), class = "Date"), Flog90 = c(1.51851393987789,
1.48970743802793, 1.81243963000062, 0.273575501327576, 0.874218895695207,
1.89762709129044, 1.44012088794774, 0.301029995663981, 1.23603370361931,
0.301029995663981)), .Names = c("SampleDate", "Flog90"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
library(zoo);library(xts)
GrowingArealog90s <-as.data.frame(GrowingArealog90s) #to remove dplyr format
GrowingArealog90s.zoo <-zoo(GrowingArealog90s[,-1,drop=FALSE],as.Date(GrowingArealog90s[,1]))
#First aggregate by month. I chose to get the mean per month
GrowingArealog90s.agg <-aggregate(GrowingArealog90s.zoo, as.yearmon, mean) #replace mean with last to get last reading of the month
#Then create a sequence of months and merge it
my.seq <-seq.Date(first(GrowingArealog90s[,1]), last(GrowingArealog90s[,1]),by="month")
merged <-merge.zoo(GrowingArealog90s.agg ,zoo(,as.yearmon(my.seq)))
na.locf(merged)

Creating Netcdf files issue

I have created some netcdf files in R before, but right now, I am having some problems to create a netcdf file that I don't know how handle it. I have been looking for the error but I am not sure why it is. Given that my data is too long, I include a smaller sample to give an idea about the structure:
#data.frame with the date and the values
dat <-dput(y.or[1:10,])
structure(list(date = structure(c(852073200, 852159600, 852246000,
852332400, 852418800, 852505200, 852591600, 852678000, 852764400,
852850800), class = c("POSIXct", "POSIXt"), tzone = ""), dymax = c(79.125,
75.375, 78, 72.375, 76.375, 76.571, 76.125, 82.75, 86.125, 86
)), .Names = c("date", "dymax"), row.names = c("1997-01-01.01",
"1997-01-01.02", "1997-01-01.03", "1997-01-01.04", "1997-01-01.05",
"1997-01-01.06", "1997-01-01.07", "1997-01-01.08", "1997-01-01.09",
"1997-01-01.10"), class = "data.frame")
#****Creating Netcdf files********
#One lat and lon, and 5478 days (14 years)
missval <- -999
dimX <- dim.def.ncdf( "longitude", "degrees_east",10)
dimY <- dim.def.ncdf( "latitude", "degrees_north", 50)
dimT <- dim.def.ncdf("time",as.Date(dates[1]),as.numeric(dates))
#Def.variable
var <- var.def.ncdf(name="max8hO3","ppb",list(dimX,dimY,dimT), missval=missval, longname="max8hO3",prec="double")
#creating the file
fil <- create.ncdf("fileout.nc",var)
Then, before put the variable into the file , I have:
Error in nc$var[[nc$varid2Rindex[varid]]] :
attempt to select less than one element
I am sure that I am missing something...but I don't know, any idea???
I really appreciate some help, thanks!

Resources