Sorting and reordering in r - r

I have a dataframe consisting of 96321 observatipns of 11 variables. This data is confidential so I am not able to share it with you. Although I am sharing some screenshot of my data.
My focus is on the FY and OM variables.
levels(mydata$FY)
[1] "2010/11" "2011/12" "2012/13" "2013/14" "2014/15" "2015/16"
levels(mydata$OM)
[1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"
I just want to re-arrange the levels of the 'OM' variable as I want to start my year from April to March (financial Year).
I used the following command to rearrange the levels of my 'OM' variables:
table(is.na(mydata$OM))
FALSE
96321
levels(mydata$OM)<-c('Apr','May','Jun','July','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar'
)
table(is.na(mydata$OM)) #NO NA is introduced
FALSE
96321
levels(mydata$OM)
[1] "Apr" "May" "Jun" "July" "Aug" "Sep" "Oct" "Nov" "Dec" "Jan" "Feb" "Mar"
I got the result as I expected but when I tried to arrange my data sorted by the 'OM' variable using sql I am not getting the desired result.
sortedData <-sqldf('SELECT * FROM mydata
ORDER BY OM ASC')
I expected the result in increasing order of levels of 'OM' variable like Apr first then May and then Mar in the last. But the order is somewhat distorted. Please help me on this.
Note:- I also tried
mydata$OM <- factor(mydata$OM, levels = c('Apr','May','Jun','July','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar'
))
mydata$OM <-factor(mydata$OM, levels = c('Apr','May','Jun','July','Aug','Sep','Oct','Nov','Dec',
'Jan','Feb','Mar'),
labels = c('Apr','May','Jun','July','Aug','Sep','Oct','Nov','Dec',
'Jan','Feb','Mar'))
But these introduced NA in the result.
table(is.na(mydata$OM))
FALSE TRUE
88097 8224

mydata$OM <- factor(mydata$OM, levels = c('Apr','May','Jun','July','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar'
))

Use mydata[order(mydata$OM),]
This will solve your problem. In case of Multiple sorting use
mydata[order(mydata$OM,mydata$FY),]

Related

Remove extra 0 in front of numeric month

I have a df with a column which has dates stored in character format, for which I want to extract the months. For this I use the following:
mutate(
Date = as.Date(
str_remove(Timestamp, "_.*")
),
Month = month(
Date,
label = F)
)
However, the October, November and December are stored with an extra zero in front of the month. The lubridate library doesn't recognise it. How can I adjust the code above to fix this? This is my Timestamp column:
c("2021-010-01_00h39m", "2021-010-01_01h53m", "2021-010-01_02h36m",
"2021-010-01_10h32m", "2021-010-01_10h34m", "2021-010-01_14h27m"
)
First convert the values to date and use format to get months from it.
format(as.Date(x, '%Y-0%m-%d'), '%b')
#[1] "Oct" "Oct" "Oct" "Oct" "Oct" "Oct"
%b gives abbreviated month name, you may also use %B or %m depending on your choice.
format(as.Date(x, '%Y-0%m-%d'), '%B')
#[1] "October" "October" "October" "October" "October" "October"
format(as.Date(x, '%Y-0%m-%d'), '%m')
#[1] "10" "10" "10" "10" "10" "10"
One way would be use strsplit to extract the second element:
month.abb[readr::parse_number(sapply(strsplit(x, split = '-'), "[[", 2))]
which will return:
#"Oct" "Oct" "Oct" "Oct" "Oct" "Oct"
data:
c("2021-010-01_00h39m", "2021-010-01_01h53m", "2021-010-01_02h36m",
"2021-010-01_10h32m", "2021-010-01_10h34m", "2021-010-01_14h27m"
) -> x

Extract the months that fall in a lubridate interval

Given a lubridate interval, for example:
start <- "2016-09-24"
finish <- "2016-11-02"
my_interval <- lubridate::interval(start, finish)
my_interval
> my_interval
[1] 2016-09-24 UTC--2016-11-02 UTC
I would like to be able to extract the months that fall into this interval, in this case:
[1] "Sep" "Oct" "Nov"
So far, my best attempt at this is really clunky:
my_months <- list(
"Aug" = interval("2016-08-01", "2016-08-31"),
"Sep" = interval("2016-09-01", "2016-09-30"),
"Oct" = interval("2016-10-01", "2016-10-31"),
"Nov" = interval("2016-11-01", "2016-11-30"),
"Dec" = interval("2016-12-01", "2016-12-31")
)
extract_months <- function(x, months) {
out <- vector(mode = "character")
for (i in seq_along(months)) {
in_month <- int_overlaps(x, months[[i]])
if (in_month) {
out[i] <- names(months)[i]
}
out <- out[!is.na(out)]
}
out
}
extract_months(x = my_interval, months = my_months)
> extract_months(x = my_interval, months = my_months)
[1] "Sep" "Oct" "Nov"
over many years this quickly becomes unwieldy. I'm hoping somebody has a better solution.
I fail to see how this question is a duplicate of Subset a dataframe between 2 dates
It's actually very simple!
library(lubridate)
month.abb[month(start):month(finish)]
Let me know if this doesn't work.
The problem with #Kim's solution is, that it won't work anymore if an interval spans over 1 year:
library(lubridate)
# works:
month.abb[month("2016-09-24"):month("2016-11-02")]
[1] "Sep" "Oct" "Nov"
# wrong (should be Sep, Oct, Nov, Dec, Jan):
month.abb[month("2016-09-24"):month("2017-01-02")]
[1] "Sep" "Aug" "Jul" "Jun" "May" "Apr" "Mar" "Feb" "Jan"
One solution could be:
# correct:
month.abb[unique(month(seq.Date(from = as.Date("2016-09-24"), to = as.Date("2017-01-02"), by = "day")))]
[1] "Sep" "Oct" "Nov" "Dec" "Jan"
One step further might include the year:
library(lubridate)
# the next 12 months starting on the first of next month
my_interval = interval(ceiling_date(Sys.Date(),unit = "month"),
ceiling_date(Sys.Date(),unit = "month") + years(1) - days(1))
year_month_vec <- paste0(year(seq.Date(from = date(int_start(my_interval)),to = date(int_end(my_interval)),by = "month")),"-",
month.abb[month(seq.Date(from = date(int_start(my_interval)),to = date(int_end(my_interval)),by = "month"))])

Parsing/formatting odd date formats with lubridate

I am having some trouble formatting the following date with lubridate. I'm not married to the lubridate approach but can someone recommend a good way to format these wonky Sept dates?
library(lubridate)
df <- data.frame(y=1:5, Date=c("Sept 1 2002","Sept 7 2002","Sept 9 2002","Sept 20 2002","Sept 21 2002"))
I didn't really expect this to work:
df$Date2=mdy(df$Date)
But I do not understand why this one didn't work:
df$Date2=parse_date_time(df$Date, "%b %d %Y")
Any ideas?
It will work if we match the abbreviations as in month.abb. One option would be to remove the 't' in 'Sept' using sub.
mdy(sub('(...).', '\\1', df$Date))
#[1] "2002-09-01 UTC" "2002-09-07 UTC" "2002-09-09 UTC" "2002-09-20 UTC" "2002-09-21 UTC"
and
month.abb
#[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
If we look at ?strptime
%b: Abbreviated month name in the current locale on this platform.
(Also matches full name on input: in some locales there are no
abbreviations of names.)

Function for getting one instance of each element in a vector in R

I would like to know which elements occur in a vector that has a lot of clones. Please, before you suggest using levels(), let me explain first.
So, for example:
data <-c( "Jan", "Jan", "Feb", "Feb", "Feb", "Mar" )
supermagicfunction( data )
[1] "Jan" "Feb" "Mar"
As you see, I'm working with dates. I'm using POSIX (actually strftime()) for that. This is where the problem is. Normally, I would use levels. But that returns all months of the year as levels because I work with POSIX dates. Like this:
levels( data )
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
I assume POSIXct kindly determines the levels for this vector.
Now, my question is: Does anyone know a function (perhaps even a primitive?) that could help here?
Ha! I just found it myself. This will work:
unique( data )
[1] "Jan" "Feb" "Mar"
And it's fast, too.

How to subset a vector, based on the condition "contains" a character?

If I have a vector:
Months = month.abb[1:12]
I want to extract all the months that start with Letter J (in this case, Jan, Jun, and Jul).
Is there a wildcard character, like * in Excel, which lists all elements of vectors which you search for J*?
How do I extract elements that start with either letter 'M' or 'A'. The expected output would be Mar,May,Apr,Aug?
Try:
grep("^J", Months,value=TRUE)
#[1] "Jan" "Jun" "Jul"
grep("^A|^M", Months,value=TRUE)
#[1] "Mar" "Apr" "May" "Aug"
You'll find the glob2rx function helpful for converting wildcard constructions to regular expressions:
> glob2rx("J*")
[1] "^J"
> grep(glob2rx("J*"), Months, value=TRUE)
[1] "Jan" "Jun" "Jul"
If you happen to have stringr loaded you could do:
library(stringr)
str_subset(Months, "^J")
[1] "Jan" "Jun" "Jul"

Resources