Trying to convert month number to month name in a date set - r

Im getting NA value when im trying to replace month number with month name with the below code:
total_trips_v2$month <- ordered(total_trips_v2$month, levels=c("Jul","Aug","Sep","Oct", "Nov","Dec","Jan", "Feb", "Mar","Apr","May","Jun"))
Im working with a big data set where the month column was char data type and the months were numbered as '06','07' and so on starting with 06.
Im not quiet sure even the ordered function in the code which i used, what it really does.I saw it somewhere and i used it. I tried to look up codes to replace specific values in rows but it looked very confusing.
Can anyone help me out with this?

Working with data types can be confusing at times, but it helps you with what you want to achieve. Thus, make sure you understand how to move from type to type!
There are some "helpers" build in to R to work with months and months' names.
Below we have a "character" vector in our data frame, i.e. df$month.
The helper vectors in R are month.name (full month names) and month.abb (abbreviated month names).
You can index a vector by calling the element of the vector at the n-th position.
Thus, month.abb[6] will return "Jun".
We use this to coerce the month to "numeric" and then recode it with the abbreviated names.
# simulating some data
df <- data.frame(month = c("06","06","07","09","01","02"))
# test index month name
month.abb[6]
# check what happens to our column vector - for this we coerce the 06,07, etc. to numbers!
month.abb[as.numeric(df$month)]
# now assign the result
df$month_abb <- month.abb[as.numeric(df$month)]
This yields:
df
month month_abb
1 06 Jun
2 06 Jun
3 07 Jul
4 09 Sep
5 01 Jan
6 02 Feb

The lubridate package can also help you extract certain components of datetime objects, such as month number or name.
Here, I have made some sample dates:
tibble(
date = c('2021-01-01', '2021-02-01', '2021-03-01')
) %>%
{. ->> my_dates}
my_dates
# # A tibble: 3 x 1
# date
# <chr>
# 2021-01-01
# 2021-02-01
# 2021-03-01
First thing we need to do it convert these character-formatted values to date-formatted values. We use lubridate::ymd() to do this:
my_dates %>%
mutate(
date = ymd(date)
) %>%
{. ->> my_dates_formatted}
my_dates_formatted
# # A tibble: 3 x 1
# date
# <date>
# 2021-01-01
# 2021-02-01
# 2021-03-01
Note that the format printed under the column name (date) has changed from <chr> to <date>.
Now that the dates are in <date> format, we can pull out different components using lubridate::month(). See ?month for more details.
my_dates_formatted %>%
mutate(
month_num = month(date),
month_name_abb = month(date, label = TRUE),
month_name_full = month(date, label = TRUE, abbr = FALSE)
)
# # A tibble: 3 x 4
# date month_num month_name_abb month_name_full
# <date> <dbl> <ord> <ord>
# 2021-01-01 1 Jan January
# 2021-02-01 2 Feb February
# 2021-03-01 3 Mar March
See my answer to your other question here, but when working with dates in R, it is good to leave them in the default YYYY-MM-DD format. This generally makes calculations and manipulations more straightforward. The month names as shown above can be good for making labels, for example when making figures and labelling data points or axes.

Related

Easiest way to convert a data.frame to a time series object in R

I need to read a data series stored in a .csv in R and analyze it using the library TSstudio. This data series consists of two columns, the first one stores the date, the second one stores a floating point value measured daily. As straightforward as it could get.
So I first read the csv as a data.frame:
a_data_frame <- read.csv("some_data.csv", sep=";", dec = ",", col.names=c("date", "value"))
head(a_data_frame)
A data.frame: 6 × 2
date value
<chr> <dbl>
1 04/06/1986 0.065041
2 05/06/1986 0.067397
3 06/06/1986 0.066740
4 09/06/1986 0.068247
5 10/06/1986 0.067041
6 11/06/1986 0.066740
The values in the first column are of type char, so I convert them to date thanks to the library lubridate:
library(lubridate)
a_data_frame$date <- dmy(a_data_frame$date)
head(a_data_frame)
A data.frame: 6 × 2
date value
<date> <dbl>
1 1986-06-04 0.065041
2 1986-06-05 0.067397
3 1986-06-06 0.066740
4 1986-06-09 0.068247
5 1986-06-10 0.067041
6 1986-06-11 0.066740
Here comes my headache. When I try to convert the data.frame to time series, I get a matrix of type character instead:
a_time_series <- as.ts(a_data_frame)
head(a_time_series)
A matrix: 6 × 2 of type chr
date value
1986-06-04 0.065041
1986-06-05 0.067397
1986-06-06 0.066740
1986-06-09 0.068247
1986-06-10 0.067041
1986-06-11 0.066740
Is there any other way to convert a data.frame to a ts object?
Assuming some_data.csv generated reproducibly in the Note read it into a zoo series and then use as.ts. That gives a daily series with NA's for the missing days and the time being the number of days since the Epoch. That may or may not be the ts object you want but the question did not specify it further. Also see this answer.
library(zoo)
z <- read.csv.zoo("some_data.csv", format = "%d/%m/%Y")
tt <- as.ts(z); tt
## Time Series:
## Start = 5998
## End = 6005
## Frequency = 1
## [1] 0.065041 0.067397 0.066740 NA NA 0.068247 0.067041
0.066740
Note
Lines <- "date,value
04/06/1986,0.065041
05/06/1986,0.067397
06/06/1986,0.066740
09/06/1986,0.068247
10/06/1986,0.067041
11/06/1986,0.066740"
cat(Lines, file = "some_data.csv")

How to use group_by without ordering alphabetically?

I'm trying to visualize some bird data, however after grouping by month, the resulting output is out of order from the original data. It is in order for December, January, February, and March in the original, but after manipulating it results in December, February, January, March.
Any ideas how I can fix this or sort the rows?
This is the code:
BirdDataTimeClean <- BirdDataTimes %>%
group_by(Date) %>%
summarise(Gulls=sum(Gulls), Terns=sum(Terns), Sandpipers=sum(Sandpipers),
Plovers=sum(Plovers), Pelicans=sum(Pelicans), Oystercatchers=sum(Oystercatchers),
Egrets=sum(Egrets), PeregrineFalcon=sum(Peregrine_Falcon), BlackPhoebe=sum(Black_Phoebe),
Raven=sum(Common_Raven))
BirdDataTimeClean2 <- BirdDataTimeClean %>%
pivot_longer(!Date, names_to = "Species", values_to = "Count")
You haven't shared any workable data but i face this many times when reading from csv and hence all dates and data are in character.
as suggested, please convert the date data to "date" format using lubridate package or base as.Date() and then arrange() in dplyr will work or even group_by
example :toy data created
birds <- data.table(dates = c("2020-Feb-20","2020-Jan-20","2020-Dec-20","2020-Apr-20"),
species = c('Gulls','Turns','Gulls','Sandpiper'),
Counts = c(20,30,40,50)
str(birds) will show date is character (and I have not kept order)
using lubridate convert dates
birds$dates%>%lubridate::ymd() will change to date data-type
birds$dates%>%ymd()%>%str()
Date[1:4], format: "2020-02-20" "2020-01-20" "2020-12-20" "2020-04-20"
save it with birds$dates <- ymd(birds$dates) or do it in your pipeline as follows
now simply so the dplyr analysis:
birds%>%group_by(Months= ymd(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
will give
# A tibble: 4 x 3
Months N Species_Count
<date> <int> <dbl>
1 2020-01-20 1 30
2 2020-02-20 1 20
3 2020-04-20 1 50
However, if you want Apr , Jan instead of numbers and apply as.Date() with format etc, the dates become "character" again. I woudl suggest you keep your data that way and while representing in output for others -> format it there with as.Date or if using DT or other datatables -> check the output formatting options. That way your original data remains and users see what they want.
this will make it character
birds%>%group_by(Months= as.character.Date(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
A tibble: 4 x 3
Months N Species_Count
<chr> <int> <dbl>
1 2020-Apr-20 1 50
2 2020-Dec-20 1 40
3 2020-Feb-20 1 20
4 2020-Jan-20 1 30

R tsibble add support for custom index

Problem description
I work with trice monthly data a lot. Trice monthly (or roughly every 10 days, also referred to as a dekad) it is the typical reporting interval for water related data in the former Soviet Union and for many more climate/water related data sets around the world. Below is an examplary data set with 2 variables:
> date = unique(floor_date(seq.Date(as.Date("2019-01-01"), as.Date("2019-12-31"),
by="day"), "10days"))
> example_data <- tibble(
date = date[day(date)!=31],
value = seq(1,36,1),
var = "A") %>%
add_row(tibble(
date = date[day(date)!=31],
value = seq(10,360,10),
var = "B"))
> example_data
# A tibble: 72 x 3
# Groups: var [2]
date value var
<ord> <dbl> <chr>
1 2019-01-01 1 A
2 2019-01-01 10 B
3 2019-01-11 2 A
4 2019-01-11 20 B
5 2019-01-21 3 A
6 2019-01-21 30 B
7 2019-02-01 4 A
8 2019-02-01 40 B
9 2019-02-11 5 A
10 2019-02-11 50 B
# … with 62 more rows
In the example I chose the 1., 11., and 21. to date the decades but it would actually be more appropriate to index them in dekad 1 to 3 per month (analogue to months 1 to 12 per year) or in dekad 1 to 36 per year (analogue to day of the year). The most elegant solution would be to have a proper date format for dekadal data like yearmonth in lubridate. However, lubridate may not plan to do support dekadal data in the near future (github conversation).
I have workflows using tsibble and timetk which work well with monthly data but it would really be more appropriate to work with the original dekadal time steps and I'm looking for a way to be able to use the tidyverse functions with dekadal data with as few cumbersome workarounds as possible.
The problem with using daily dates for dekadal data in tsibble is that is identifies the time interval as daily and you get a lot of data gaps between your 3 values per month:
> example_data_tsbl <- as_tsibble(example_data, index = date, key = var)
> count_gaps(example_data_tsbl, .full = FALSE)
# A tibble: 70 x 4
var .from .to .n
<chr> <date> <date> <int>
1 A 2019-01-02 2019-01-10 9
2 A 2019-01-12 2019-01-20 9
3 A 2019-01-22 2019-01-31 10
# …
Here's what I did so far:
I saw here the possibility to define ordered factors as indices in tsibble but timetk does not recognise factors as indices. timetk suggests to define custom indices (see 2.).
There is the possibility to add custom indices to tsibble but I haven't found examples on this and I don't understand how I have to use these functions (a vignette is still planned). I have started reading the code to try to understand how to use the functions to get support for dekadal data but I'm a bit overwhelmed.
Questions
Will dekadal custom indices in tsibble behave similarly as the yearmonth or weekyear?
Would anyone here have an example to share on how to add custom indices to tsibble?
Or does anyone know of another way to elegantly handle dekadal data in the tidyverse?
This doesn't discuss tsibbles but it was too long for a comment and does provide an alternative.
zoo can do this either by (1) the code below which does not require the creation of a new class or (2) by creating a new class and methods. For that alternative following the methods that the yearmon class has would be sufficient. See here. zoo itself does not have to be modified.
As we see below, for the first approach dates will be shown as year(cycle) where cycle is 1, 2, ..., 36. Internally the dates are stored as year + (cycle-1)/36 .
It would also be possible to use ts class if the dates were consecutive month thirds (or if not if you don't mind having NAs inserted to make them so). For that use as.ts(z).
Start a fresh session with no packages loaded and then copy and paste the input DF shown in the Note at the end and then this code. Date2dek will convert a Date vector or a character vector representing dates in standard yyyy-mm-dd format to a dek format which is described above. dek2Date performs the inverse transformation. It is not actually used below but might be useful.
library(zoo)
# convert Date or yyyy-mm-dd char vector
Date2dek <- function(x, ...) with(as.POSIXlt(x, tz="GMT"),
1900 + year + (mon + ((mday >= 11) + (mday >= 21)) / 3) / 12)
dek2Date <- function(x, ...) { # not used below but shows inverse
cyc <- round(36 * (as.numeric(x) %% 1)) + 1
if(all(is.na(x))) return(as.Date(x))
month <- (cyc - 1) %/% 3 + 1
day <- 10 * ((cyc - 1) %% 3) + 1
year <- floor(x + .001)
ix <- !is.na(year)
as.Date(paste(year[ix], month[ix], day[ix], sep = "-"))
}
# DF given in Note below
z <- read.zoo(DF, split = "var", FUN = Date2dek, regular = TRUE, freq = 36)
z
The result is the following zooreg object:
A B
2019(1) 1 10
2019(2) 2 20
2019(3) 3 30
2019(4) 4 40
2019(5) 5 50
Note
DF <- data.frame(
date = as.Date(ISOdate(2019, rep(1:2, 3:2), c(1, 11, 21))),
value = c(1:5, 10*(1:5)),
var = rep(c("A", "B"), each = 5))
Extending tsibble to support a new index requires defining methods for these generics:
index_valid() - This method should return TRUE if the class is acceptable as an index
interval_pull() - This method accepts your index values and computes the interval of the data. The interval can be created using tsibble:::new_interval(). You may find tsibble::gcd_interval() useful for computing the smallest interval.
seq() and + - These methods are used to produce future time values using the new_data() function.
A minimal example of a new tsibble index class for 'year' is as follows:
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, union
library(vctrs)
# Object creation function
my_year <- function(x = integer()) {
x <- vec_cast(x, integer())
vctrs::new_vctr(x, class = "year")
}
# Declare this class as a valid index
index_valid.year <- function(x) TRUE
# Compute the interval of a year input
interval_pull.year <- function(x) {
tsibble::new_interval(
year = tsibble::gcd_interval(vec_data(x))
)
}
# Specify how sequences are generated from years
seq.year <- function(from, to, by, length.out = NULL, along.with = NULL, ...) {
from <- vec_data(from)
if (!rlang::is_missing(to)) {
vec_assert(to, my_year())
to <- vec_data(to)
}
my_year(NextMethod())
}
# Define `+` operation as needed for `new_data()`
vec_arith.year <- function(op, x, y, ...) {
my_year(vec_arith(op, vec_data(x), vec_data(y), ...))
}
# Use the new index class
x <- tsibble::tsibble(
year = my_year(c(2018, 2020, 2024)),
y = rnorm(3),
index = "year"
)
x
#> # A tsibble: 3 x 2 [2Y]
#> year y
#> <year> <dbl>
#> 1 2018 0.211
#> 2 2020 -0.410
#> 3 2024 0.333
interval(x)
#> <interval[1]>
#> [1] 2Y
new_data(x, 3)
#> # A tsibble: 3 x 1 [2Y]
#> year
#> <year>
#> 1 2026
#> 2 2028
#> 3 2030
Created on 2021-02-08 by the reprex package (v0.3.0)

Can´t convert chr to numeric in R-studio

I get NA´s when i try to convert into numeric values (see below)
Im supposed to make these annual dataframes into monthly ones. to do this i need to make the numbers numeric. I get NA´s when i try to do this. does anyone know?
When you unlist() the data frame, it turns it into a vector. Here's a couple of lines of the data that I can see from your post (with shorter variable names).
TBS <- tibble::tibble(
desc = c("1934-01", "1934-02"),
rate = c("0.72", "0.6")
)
unlist(TBS)
# desc1 desc2 rate1 rate2
# "1934-01" "1934-02" "0.72" "0.6"
When you do as.numeric() on that vector, it turns the dates into missing. I think that's what the output above in your RStudio window shows us.
as.numeric(unlist(TBS))
# [1] NA NA 0.72 0.60
You're probably better off just fixing the variables in place in the data frame, like this:
library(zoo)
library(lubridate)
library(dplyr)
TBS <- TBS %>%
mutate(desc = as.yearmon(desc),
year = year(desc),
rate = as.numeric(rate))
TBS
# A tibble: 2 x 3
# desc rate year
# <yearmon> <dbl> <dbl>
# 1 Jan 1934 0.72 1934
# 2 Feb 1934 0.6 1934
Then you could do whatever you need (e.g., average) over the years. If it was just a straight average, you could do.
TBS %>%
group_by(year) %>%
summarise(mean_rate = mean(rate))

Split date data (m/d/y) into 3 separate columns

I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.

Resources