sorting months in R with count or summarise - r

I am trying to sort months with R and have the following:
```{r}
result <- mydata %>%
count(months(as.Date(orderdate)))
result
```
This results with the monts and a count of the orders. However, the months are not ordered correctly by month. How can i sort this correctly by month?
I already did try to use "order" and "factor", however this was not working correctly. How can i use a short code and order correctly?
Thanks,
Roland

Do:
count(
factor(months(as.Date(orderdate)), month.name)
)
since months() "return[s] a character vector of [month] names in the locale in use" and is a less brittle solution than hard-coding U.S./English month names.

Since you have no sample data provided, I can't test if it really works in your specific case. Converting your months-data to a factor works, but you'll have to specify the names of the months first (in the language of your data), as R doesn't know how you want them to be ordered. Thus, creating a factor without defining the levels will only lead to alphabetical order, which isn't correct for months.
library(dplyr)
result <- mydata %>%
mutate(ordered_months = factor(months(as.Date(orderdate)),
levels=c("January", "February", ...))) %>% # insert all month-names here
count(ordered_months)
result
...should work.

The following assumes your locale speaks the English language. This is because the built-in variable month.name uses English month names.
First of all, make up a dataset, since you have not posted one.
set.seed(1)
d <- seq(as.Date("2017-01-01"), Sys.Date(), by = "month")
mydata <- data.frame(orderdate = sample(d, 1e2, TRUE))
Now the problem. Note that order is its own inverse, the fact that this answer uses.
library(dplyr)
library(lubridate)
result <- mydata %>%
count(months(as.Date(orderdate)))
inx <- order(month.name)
result[order(inx), ]

Related

Why am I getting an 'Error in UseMethod(arrange) in r?

I'm writing an r program which lists sales prices for various items. I have a column called InvoiceDate, which lists date and time as follows: '12/1/2009 7:45'. I'm trying to isolate the date only in a separate field called date, and then arrange the dates sequentially. The code I'm using is as follows:
library(dplyr)
library(ggplot2)
setwd("C:/Users/cshor/OneDrive/Environment/Restoration_Ecology/Udemy/Stat_Thinking_&_Data_Sci_with_R/Assignments/Sect_5")
retail_clean <- read.csv("C:/Users/cshor/OneDrive/Environment/Restoration_Ecology/Udemy/Stat_Thinking_&_Data_Sci_with_R/Data/retail_clean.csv")
retail_clean$date <- as.Date(retail_clean$InvoiceDate)#, format = "%d/%m/%Y")
total_sales = sum(retail_clean$Quantity, na.rm=TRUE) %>%
arrange(retail_clean$date) %>% ggplot(aes(x=date, y=total_sales)) + geom_line()
Initially, everything works fine, and the date field is created. However, I get the following error for the arrange() function:
Error in UseMethod("arrange") :no applicable method for 'arrange' applied to an object of class "c('integer', 'numeric')"
I've searched for over a week for a solution to this problem, but have found nothing that specifically addresses this issue. I've also used '.asPosixct' instead of .asDate, with similar results. Any help as to why the program interprets Date data as numeric, and how I can correct the problem, would be greatly appreciated.
First, the error message is not about Date time.
Let's look at the code you provided:
total_sales = sum(retail_clean$Quantity, na.rm=TRUE) %>%
arrange(retail_clean$date) %>% ggplot(aes(x=date, y=total_sales)) + geom_line()
The result of this term sum(retail_clean$Quantity, na.rm=TRUE) is an integer in your case, and it is piped into the first argument of the dpyr::arrange function, which calls UseMethod("arrange").
Then, the piped argument is inspected as being an object of class of integer and numeric, and arrange do not have a method for these classes, that is, neither arrange.integer nor arrange.numeric are defined. Hence the error msg. There is nothing wrong with you date convertion except that you do need that format term you commented out in the code sample.
The solution is also simple. Change sum to something that returns a data.frame or other classes that arrange is aware of. You can check what methods are available for arrange:
$>methods(dplyr::arrange)
[1] arrange.data.frame*
In this R instance, you can only put a data.frame object through arrange, but you can always define specific methods for other classes.
Looks like this is a Udemy course assignment. Maybe here you need to calculate a sum for each day or each month, whichever your assignment is asking you to do, but sum is definitely not the right answer.
By the way, welcome to SO!
Update:
An example
n <- 100
data <- data.frame(sales = runif(n), day = sample(1:30, n, replace = TRUE))
data$date_ <- paste0(data$day, "/1/2009 7:45")
head(data$date_) # This is the orignial date string
data$date <- as.Date(data$date_, format = "%d/%m/%Y")
head(data$date) # Check here to see the formated date
library(dplyr)
library(ggplot2)
data %>%
group_by(date) %>%
summarise(totalSale = sum(sales, na.rm=TRUE)) %>%
arrange(date) %>%
ggplot(aes(x = date, y = totalSale)) +
geom_line()
Here is the plot
It looks fine, isn't it? The sales are all ordered by date now.

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))

dplyr passing column names as a variable with is.na filter

I am aware that similar questions have been asked and I have tried multiple options but I am still having an error message.
df_construction <- function(selected_month, selected_variable){
selected_variable_en <- rlang::enquo(selected_variable) #This was an attempt following the link
#filter_criteria <- interp(!is.na(~y), .values = list(y = as.name(selected_variable))) This doesn't work
df1 <- airquality %>%
dplyr::filter(Month == selected_month,
!is.na(selected_variable_en))%>%
select(Month, Day, !!selected_variable)
return(df1)}
df1 <- df_construction(2, "Solar.R")
My ultimate goal is to build this in Shiny and thus have inputs the user will have selected as arguments in the function.
I know that the filter and the select functions shouldn't be dealt with in the same way.
I have followed the steps according to: https://www.brodrigues.co/blog/2016-07-18-data-frame-columns-as-arguments-to-dplyr-functions/ but had no success due to the !is.na filter.
I just want to have a dataframe where the only columns are the Month column for the selected months, the Day column and whichever column from the choice Ozone, Solar.R, Wind, Temp the user has selected, without any NA.
Thank you very much for your help!!
!! is often not enough to unquote variable names. You often need them in conjunction with rlang::sym. And if you have more than one variable to unquote, you need to use !!! and rlang::syms
df_construction <- function(selected_month, selected_variable){
df1 <- airquality %>%
dplyr::filter(Month == selected_month,
!is.na(!!rlang::sym(selected_variable_en)))%>%
select(Month, Day, selected_variable)
return(df1)
}
For select, you can directly put variable names. There has been a new functionality in dplyr to unquote {{}} but it does not work in all cases.
If you start writing variables names in functions, you might have difficulties with dplyr. In that aspect, data.table is easier to use (see a blog post I wrote on the subject)

Sorting formatted dates in R

I have a dataframe of dates and numeric values in R. The dates are all the first of the month and the values are a number associated with that month
library(DT)
library(dplyr)
df <- data.frame(date = as.Date(c("2017-01-01","2017-02-01","2017-03-01","2017-04-01")),
val = c(-5600,7000,4200,-2000))
I'd like to stick this through DT::datatable(), which is my new favourite thing. However, I'd like to have the output formatted nicely, thousand separators, nice dates etc.
df <- df %>% mutate(val = formatC(val, big.mark=","))
datatable(df)
This turns val into a character vector, although datatable() is apparently able to recognise that it's really a number and sort appropriately using the arrows in the header. So far so good.
However the issue comes when I try to format the date as MMM YY.
df <- df %>% mutate(date = format(date, "%b %y"))
datatable(df)
This turns date into a character vector as well - the values look like "Jan 17" etc. Everything looks fine, only trouble is when I go to sort by date, it doesn't recognise the values as months and puts them in alphabetical rather than chronological order.
Is there any way of reformatting the dates, either prior to or whilst passing them to datatable(), to keep the "date-ness" of the variable and allow it to be sorted appropriately? Failing that, is there another package that outputs interactive tables and is better at sorting?
Thanks in advance,
James
you can take help of lubridate package.
And do the stuff using this function.
What you need to do is take month and date separately into account.
library(lubridate)
date_conversion<-function(df){
months<-month(df$date,label = T)
years<-year(df$date)
months_years<-paste(months, years, sep = " ")
df[1]<-months_years
df[order(row.names(df),decreasing = F),]
}
hope this helps you .... :)
DataTables as integrated in R by the DT package has options to format numeric and date variables while maintaining the proper sort order.
Below, I will discuss three different options:
library(DT)
df <- data.frame(date = as.Date(c("2017-01-01","2017-02-01","2017-03-01","2017-04-01")),
val = c(-5600,7000,4200,-12000))
Please, note that I've deliberately choosen to change the last value in column val to demonstrate a pitfall in using formatC().
# OP's own formatting
df$val_chr <- formatC(df$val, big.mark=",")
df$date_chr <- format(df$date, "%b %y")
# copy columns to demonstrate DT formatting
df$val_dt <- df$val
df$date_dt <- df$date
# ISO 8601 year-month format as alternative
df$dat_iso <- format(df$date, "%Y-%m")
# create DT object and apply DT formatting
datatable(df) %>% formatCurrency("val_dt", "") %>% formatDate("date_dt", "toDateString")
Note that val_dt has been formatted nicely as expected and is right justified. In contrast, val_chr is left justified with the thousands separators not aligned. In addition, formatC() has recognized that val is of type double and has used the "g" format by default. According to the description of the formatparameter in ?formatC Default is "d" for integers, "g" for reals. So, we do get
formatC(12000L, big.mark=",")
#[1] "12,000"
but
formatC(12000, big.mark=",")
#[1] "1.2e+04"
Sorting by date_dt within the datatables object by clicking on the small arrows symbols at the right side of the column headers works as expected in contrast to date_chr. Unfortunately, the number of available methods for formatDate() is limited and doesn't include the desired month-year format. (There is a datetime plugin which converts date / time source data into one suitable for display but I haven't explored that in detail.)
Column date_iso shows the abbreviated ISO 8601 format YYYY-MM as a third option. This is my favoured format (which I do use alot also for aggregating by month) because
it always sorts correctly, even for several years,
it doesn't depend on the current locale, so it works in any language,
it is short while being unambiguous,
and it is an international standard.
Addendum
The formattable package does also have various formatter functions and can create DataTables:
library(formattable)
as.datatable(formattable(df))

how to use gather_ in tidyr with variables

I'm using tidyr together with shiny and hence needs to utilize dynamic values in tidyr operations.
However I do have trouble using the gather_(), which I think was designed for such case.
Minimal example below:
library(tidyr)
df <- data.frame(name=letters[1:5],v1=1:5,v2=10:14,v3=7:11,stringsAsFactors=FALSE)
#works fine
df %>% gather(Measure,Qty,v1:v3)
dyn_1 <- 'Measure'
dyn_2 <- 'Qty'
dyn_err <- 'v1:v3'
dyn_err_1 <- 'v1'
dyn_err_2 <- 'v2'
#error
df %>% gather_(dyn_1,dyn_2,dyn_err)
#error
df %>% gather_(dyn_1,dyn_2,dyn_err_1:dyn_err_2)
after some debug I realized the error happened at melt measure.vars part, but I don't know how to get it work with the ':' there...
Please help with a solution and explain a little bit so I could learn more.
You are telling gather_ to look for the colume 'v1:v3' not on the separate column ids. Simply change dyn_err <- "v1:v3" to dyn_err <- paste("v", seq(3), sep="").
If you df has different column names (e.g. var_a, qtr_b, stg_c), you can either extract those column names or use the paste function for whichever variables are of interest.
dyn_err <- colnames(df)[2:4]
or
dyn_err <- paste(c("var", "qtr", "stg"), letters[1:3], sep="_")
You need to look at what column names you want and make the corresponding vector.

Resources