Importing Text File as Zoo in R - r

I am trying to import a text file with data that looks like:
Jan 1998 4.36
Feb 1998 4.34
Mar 1998 4.35
Apr 1998 4.37
May 1998 4.45
Jun 1998 4.54
Jul 1998 4.52
Aug 1998 4.68
Sep 1998 4.82
Oct 1998 4.72
Nov 1998 4.80
...
as a zoo in R. I have tried importing it directly as a zoo:
install.packages("zoo")
library("zoo")
FMAGX_prices <- read.csv.zoo("filepath.../FMAGX_prices.csv", format = "%m/%Y")
and importing it as a data frame and then converting it to a zoo. The reason I create the dates vector re-assign it to the front of the data frame is that by default, I get a 3 column data frame, one with the month abbreviation, one with the year, and one with the price:
install.packages("zoo")
library("zoo")
FMAGX_prices <-read.table("filepath.../FMAGX_prices.txt")
dates <- paste(FMAGX_prices$V1, FMAGX_prices$V2, sep = " ")
FMAGX_prices$V3 <- as.numeric(as.character(FMAGX_prices$V3))
FMAGX_prices$dates <- dates
FMAGX_prices <- subset(FMAGX_prices, select= c(dates, V3))
FMAGX_prices <- read.zoo(FMAGX_prices, "%b %Y")
neither method works. I always get the below error:
Error in read.zoo(FMAGX_prices, format = "%b %Y") :
index has 144 bad entries at data rows: 1 2 3 4 5 6 7 8 9 10 11...
My assumption is that there is something wrong with my date format, but I am not sure what it would be.
I've tried various combinations of arguments in the read statements, I've added headers, I've reformatted the data as a CSV, changed the dates to 01/1998, 02/1998, etc (and the corresponding arguments), but I always get that same error

Related

Formatting date column with different formats (including missing day information) - lubridate

I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).
I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.
Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem?
-> I even tried to manage more complex orders-argument for lubricate::parse_date_time like
parse_date_time(input_col, orders="mdy", "my")
but this didn't work.
Here's the code:
library("tidyverse")
library("lubridate")
Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
"June 24, 2010","January 2005")
df_dates <- data.frame(Observation, Date_original)
# looking for a comma in the cell
comma_detect <- function(a_string){
str_detect(a_string, ",")
}
# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
if_else(comma_detect(input_col),
parse_date_time(input_col, orders="mdy"),
parse_date_time(input_col, orders="my"))
}
# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
mutate(df, output_col = date_correction_row(input_col))
}
df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()
OUTPUT
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01
In the code below we assume that output_col equals "Date". They all set the column name, give no warnings and use Date class.
1) Try each format and take the one that does not give NA. This uses only base R.
output_col <- "Date"
within(df_dates, assign(output_col, pmin(na.rm = TRUE,
as.Date(Date_original, "%B %d, %Y"),
as.Date(paste(Date_original, 1), "%B %Y %d"))))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
2) This can also be done in lubridate. It is important that my is the first rather than second argument to coalesce since it outputs NA for those values that do not match the format whereas mdy gives a wrong date so if that were first coalesce would never get to my. This approach is shorter than (3) but you might prefer the robustness (3) since it does not depend on what is returned for non-matching dates.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := coalesce(my(Date_original, quiet = TRUE),
mdy(Date_original)))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
3) If you prefer your own method of first checking for comma here is a variation of that which is more compact. It uses my and mdy instead of parse_date_time since my and mdy give Date class results which are more appropriate here than the POSIXct of parse_date_time given that there are no times.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := if_else(grepl(",", Date_original),
mdy(Date_original), my(Date_original, quiet = TRUE)))
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
When the date structure is known, I like to explicitly correct the date structure first, then parse. Here I use regex to sub in 1 when the day is missing, then we just parse like normal.
library(tidyverse)
df_dates %>%
mutate(
output_col = gsub("(?<!,)\\s(?=\\d{4})", " 1, ", Date_original, perl = TRUE) %>%
as.Date(., format = '%B %d, %Y')
)
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01

Reading a date, time text file and converting to string using strptime()?

I have a text file of many rows containing date and time and the end goal is for me to group together the number of rows per week that their date values are in. This is so that I can plot a scatter diagram with x values being the week number and y values being the frequency. For example the text file (dates.txt):
Mon May 11 22:51:27 2013
Mon May 11 22:58:34 2013
Wed May 13 23:15:27 2013
Thu May 14 04:11:22 2013
Sat May 16 19:46:55 2013
Sat May 16 22:29:54 2013
Sun May 17 02:08:45 2013
Sun May 17 23:55:15 2013
Mon May 18 00:42:07 2013
So from here, week 1 will have a frequency of 6 and week 2 will have a frequency of 1
As I want to plot a scatter diagram for this, I want to convert them to text value first using strptime() with format %a %b
my attempt so far has been
time_stamp <- strptime(time_stamp, format='%a.%b')
However it shows the input string is too long. I'm very new to R-studio so could somebody please help me figure this out?
Thank you
Example of final output graph : https://imgur.com/a/3o3DivA
You could use readLines() to avoid the data frame, then read time using strptime, and finally strftime to format the output.
strftime(strptime(readLines('dates.txt'), '%c'), '%a.%b')
# [1] "Sat.May" "Sat.May" "Mon.May" "Tue.May" "Thu.May" "Thu.May" "Fri.May" "Fri.May" "Sat.May"
Edit
So it appears that your dates have a time zone abbreviation "Mon Apr 06 23:49:29 PDT 2009". Since it is constant during the dates we can specify it literally in the pattern.
We will use '%d_%m' for strftime to get something numeric seperated by _ with which we feed strsplit and then type.convert into numerics.
Finally we unlist, create a matrix that we fill byrow, and plot the guy.
strptime(readLines('timestamp.txt'), '%a %b %d %H:%M:%S PDT %Y') |>
strftime('%d_%m') |>
strsplit('_') |>
type.convert(as.is=TRUE) |>
unlist() |>
matrix(ncol=2, byrow=TRUE) |>
plot(pch=20, col=4, main='My Plot', xlab='day', ylab='month')
Note: Please use R>=4.1 for the |> pipes.
You need to first read (or assign) the data, parse it to a date type and then use that to e.g. get the number of the week.
Here is one example
text <- "Mon May 11 22:51:27 2013
Mon May 11 22:58:34 2013
Wed May 13 23:15:27 2013
Thu May 14 04:11:22 2013
Sat May 16 19:46:55 2013
Sat May 16 22:29:54 2013
Sun May 17 02:08:45 2013
Sun May 17 23:55:15 2013
Mon May 18 00:42:07 2013"
data <- read.table(text=text, sep='\n', col.names="dates")
data$parse <- anytime::anytime(data$dates)
data$week <- as.integer(format(data$parse, "%V"))
data
The result is a new data.frame object:
> data
dates parse week
1 Mon May 11 22:51:27 2013 2013-05-11 22:51:27 19
2 Mon May 11 22:58:34 2013 2013-05-11 22:58:34 19
3 Wed May 13 23:15:27 2013 2013-05-13 23:15:27 20
4 Thu May 14 04:11:22 2013 2013-05-14 04:11:22 20
5 Sat May 16 19:46:55 2013 2013-05-16 19:46:55 20
6 Sat May 16 22:29:54 2013 2013-05-16 22:29:54 20
7 Sun May 17 02:08:45 2013 2013-05-17 02:08:45 20
8 Sun May 17 23:55:15 2013 2013-05-17 23:55:15 20
9 Mon May 18 00:42:07 2013 2013-05-18 00:42:07 20
>

Average of month's data (jan-dec) in xts objects

I have this large xts, aggregated monthly with apply.monthly function.
2011-07-31 269.8
2011-08-31 251.0
2011-09-30 201.8
2011-10-31 95.8
2011-11-30 NA
2011-12-31 49.3
2012-01-31 77.1
...
What I want is to calculate the average of Jan-Dec months for all the period. Something like this, but in xts form:
01 541.8
02 23.0
03 34.8
04 12.8
05 21.8
06 44.8
07 22.8
08 55.0
09 287.8
10 15.8
11 113
12 419.3
I want to avoid using dplyr functions like group_by. I think there must be a solution using split and lapply / do.call
I tried spliting the xts in years
xtsobject <- split(xtsobject, f = "years")
and then I dont know how to use properly the lapply function in order to calculate the 12 averages (Jan-Dec) of all the period.
This question
Group by period.apply() in xts
is similar, but in my xts I dont have/want a new column, I think it can be done using the xts index.
Assuming the input data x, shown reproducibly in the Note at the end, useaggregate.zoo like this:
ag <- aggregate(x, cycle(as.yearmon(time(x))), mean, na.rm = TRUE)
ag
giving the following zoo series:
1 77.1
7 269.8
8 251.0
9 201.8
10 95.8
11 NaN
12 49.3
We could plot it like this:
plot(ag, type = "h")
Note
Lines <- "2011-07-31 269.8
2011-08-31 251.0
2011-09-30 201.8
2011-10-31 95.8
2011-11-30 NA
2011-12-31 49.3
2012-01-31 77.1"
library(xts)
z <- read.zoo(text = Lines)
x <- as.xts(z)
You can use the base::months function to extract the month before calculating the mean:
do.call(rbind, lapply(split(x, base::months(index(x))), mean, na.rm=TRUE))
output:
[,1]
April 165.1600
August 290.2444
December 106.8200
February 82.6300
January 62.9100
July 264.9889
June 246.4889
March 100.5500
May 246.3333
November 116.6400
October 151.3667
September 158.5667
It seems the index is a number and not a POSIXct object. You can convert it and use format to extract months and use it in tapply :
tapply(xtsobject[, 1], format(as.POSIXct(zoo::index(xtsobject),
origin = '1970-01-01'), '%m'), mean, na.rm = TRUE)

format a time series as dataframe with julian date

I have a time series tt.txt of daily data from 1st May 1998 to 31 October 2012 in one column as this:
v1
296.172
303.24
303.891
304.603
304.207
303.22
303.137
303.343
304.203
305.029
305.099
304.681
304.32
304.471
305.022
304.938
304.298
304.120
Each number in the text file represents the maximum temperature in kelvin for the corresponding day. I want to put the data in 3 columns as follows by adding year, jday, and the value of the data:
year jday MAX_TEMP
1 1959 325 11.7
2 1959 326 15.6
3 1959 327 14.4
If you have a vector with dates, we can convert it to 'year' and 'jday' by
v1 <- c('May 1998 05', 'October 2012 10')
v2 <- format(as.Date(v1, '%b %Y %d'), '%Y %j')
df1 <- read.table(text=v2, header=FALSE, col.names=c('year', 'jday'))
df1
# year jday
#1 1998 125
#2 2012 284
To convert back from '%Y %j' to 'Date' class
df1$date <- as.Date(do.call(paste, df1[1:2]), '%Y %j')
Update
We can read the dataset with read.table. Create a sequence of dates using seq if we know the start and end dates, cbind with the original dataset after changing the format of 'date' to 'year' and 'julian day'.
dat <- read.table('tt.txt', header=TRUE)
date <- seq(as.Date('1998-05-01'), as.Date('2012-10-31'), by='day')
dat2 <- cbind(read.table(text=format(date, '%Y %j'),
col.names=c('year', 'jday')),MAX_TEMP=dat[1])
You can use yday
as.POSIXlt("8 Jun 15", format = "%d %b %y")$yday

compute mean of last 5 days of each month in R

I am finding this to be quite tricky. I have an R time series data frame, consisting of a value for each day for about 50 years of data. I would like to compute the mean of only the last 5 values for each month. This would be simple if each month ended in the same 31st day, in which case I could just subset. However, as we all know some months end in 31, some in 30, and then we have leap years. So, is there a simple way to do this in R without having to write a complex indexing function to take account of all the possibilities including leap years? Perhaps a function that works on zoo type objects? The data frame is as follows:
Date val
1 2014-01-06 1.49
2 2014-01-03 1.38
3 2014-01-02 1.34
4 2013-12-31 1.26
5 2013-12-30 2.11
6 2013-12-26 3.20
7 2013-12-25 3.00
8 2013-12-24 2.89
9 2013-12-23 2.90
10 2013-12-22 4.5
tapply Try this where dd is your data frame and we have assumed that the Date column is of class "Date". (If dd is already sorted in descending order of Date as it appears it might be in the question then we can shorten it a bit by replacing the anonymous function with function(x) mean(head(x, 5)) .)
> tapply(dd$val, format(dd$Date, "%Y-%m"), function(x) mean(tail(sort(x), 5)))
2013-12 2014-01
2.492000 1.403333
aggregate.zoo In terms of zoo we can do this which returns another zoo object and its index is of class "yearmon". (In the case of zoo it does not matter whether dd is sorted or not since zoo will sort it automatically.)
> library(zoo)
> z <- read.zoo(dd)
> aggregate(z, as.yearmon, function(x) mean(tail(x, 5)))
Dec 2013 Jan 2014
2.492000 1.403333
REVISIONS. Made some corrections.

Resources