ggplot sort order treatment of NA values - r

My goal is to create a scatter plot of requests for service.
The X axis will be the date the request was made.
X values will show dates from oldest to newest, left to right.
The Y axis will show the priority assigned to the request.
I wish to order the Y values from highest priority at the top (i.e., 1) to lowest.
Requests which haven't been prioritized have NA in that column.
Here is a sample data set (NOTE - the original data file id tab-separated with no values in the position where "NA" is shown below for clarity's sake):
ID Priority DateCreated
549 NA 2018-02-15
548 NA 2018-02-15
547 3 2018-02-13
537 1 2018-01-17
536 5 2018-01-17
518 NA 2017-12-21
509 3 2017-11-27
500 2 2017-11-16
486 NA 2017-10-04
477 3 2017-08-08
475 1 2017-09-14
448 2 2017-07-21
444 5 2017-07-14
431 5 2017-06-30
425 1 2017-06-21
407 2 2017-05-26
395 4 2017-05-09
394 4 2017-05-09
374 4 2017-04-27
368 2 2017-04-21
352 NA 2017-04-03
328 4 2017-02-28
308 NA 2017-02-28
272 2 2016-10-05
213 4 2016-05-19
212 5 2016-05-19
200 2 2016-04-26
188 NA 2016-03-17
After loading ggplot2 and data.frame, I create the plot with this code:
bl <- fread("backlog.txt")
bl$DateCreated <- as.Date(bl$DateCreated, "%Y-%m-%d")
bl$Priority <- as.integer(bl$Priority)
ggplot(bl, aes(x = DateCreated, y = reorder(Priority, -Priority))) +
geom_text((aes(label = ID)))
If you reproduce this plot, you will see that the items with a priority of NA appear at the top. For presentation to my customer, it is much clearer if they appear at the bottom.
I suppose I could replace the NAs with a "magic number" (e.g., 11), but I'd prefer a less kludgey solution.
Anyone dealt with a similar issue already?
Thanks.

This is a bit of a workaround as well but I think more acceptable than setting a 'magic number'
bl$DateCreated <- as.Date(bl$DateCreated, "%Y-%m-%d")
bl$Priority[is.na(bl$Priority)] <- "No Data Available"
bl$Priority <- factor(bl$Priority,levels=c("No Data Available","1","2","3","4","5"))
ggplot(bl, aes(x = DateCreated, y = Priority)) + geom_text((aes(label = ID)))

Related

Rounding month end data using weekly split in R

Input table:
Date Qty
2017-01-01 234
2017-01-08 123
2017-01-15 445
2017-01-22 113
2017-01-29 674
2018-02-05 120
2018-02-12 921
2018-02-19 732
2018-02-26 634
2018-03-05 711
Expected table:
Date Qty
2017-01-01 234
2017-01-08 123
2017-01-15 445
2017-01-22 113
2017-01-29 708.28 #674+(120/7 * 2)
2018-02-05 85.71 #(120/7 * 5)
2018-02-12 921
2018-02-19 732
2018-02-26 837.14 #634+(711/7 * 2)
2018-03-05 507.85 #(711/7 * 5)
In the above o/p table, the quantity belonging to the first date of the new month is expected to split to the last date of the past month using the weekly proportions.
Eg:
2017-02-26 had a quantity of 634 and 2018-03-05 had 711
So, the quantity 711 is split by 7 (#days in a week) i.e. 711/7 = 101.571 and the month of February has 28 days in general so 2 shares (#days left in February as the present date of that row is 2017-02-26) of 101.571 are added to the existing quantity of 2017-02-26, thus making it 634+(101.571*2) => 634+203.14 => 837.14 (as you can observe in the expected table). Similarly the remaining 2 shares are deducted from the 2018-03-05 and now it remains with 5 shares (#days of the first week of the present month as the present date of that row is 2018-03-05) ie 711/5 => 507.85 (as you can observe in the expected table).
Using R how should I generalise this situation?
Does this answer:
> library(dplyr)
> first_day_of_month_wday <- function(dx) {
+ day(dx) <- 1
+ wday(dx)
+ }
> fil <- ceiling((day(df$Date) + first_day_of_month_wday(df$Date) - 1) / 7)
>
> df %>% mutate(Qty1 = case_when(fil > 4 ~ Qty + (days_in_month(df$Date) - day(Date)) * lead(Qty)/7, TRUE ~ Qty)) %>%
+ mutate(Qty1 = case_when(lag(fil) > 4 ~ Qty/7 * day(Date), TRUE ~ Qty1)) %>% select(-Qty) %>% rename(Qty = Qty1)
# A tibble: 10 x 2
Date Qty
<date> <dbl>
1 2017-01-01 234
2 2017-01-08 123
3 2017-01-15 445
4 2017-01-22 113
5 2017-01-29 708.
6 2018-02-05 85.7
7 2018-02-12 921
8 2018-02-19 732
9 2018-02-26 837.
10 2018-03-05 508.
>
PS: Used first_day_of_month_wday function from R: How to get the Week number of the month.

How to transform a dataframe into time series?

I'm sorry , i know this question has been asked a lot of times , but I'm having problems to convert my dataframe into time series.
this is my dataframe ( after dropping some columns):
head(New_DF):
ï..date qty
1 2017-07-05 61
2 2018-01-20 73
3 2017-07-10 145
4 2017-07-01 255
5 2017-05-23 267
6 2017-06-24 242
And this is what i did:
library(zoo)
as.ts(read.zoo(New_Df, FUN = as.yearmon))
And i get this Error:
Error in seq.default(head(tt, 1), tail(tt, 1), deltat) :
'from' must be a finite number
In addition: Warning message:
In zoo(rval3, ix) :
some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I think i got why , it is because i have a lot of duplicates in my i..date column , unfortunately i don't want to drop them since time-series ML Model are bit different than other routine ML models. As time-series model is based upon the sequence of previous values, dropping a Date may impact my solution.
Any suggestions would be much appreciated , thank you.
1) yearmon Assuming New_DF shown reproducibly in the Note at the end, use read.zoo specifying the argument aggregate=sum .
library(zoo)
read.zoo(New_DF, FUN = as.yearmon, aggregate = sum)
giving:
May 2017 Jun 2017 Jul 2017 Jan 2018
267 242 461 73
2) Date If you want to keep the individual rows then use Date class instead of yearmon (assuming that the dates are unique).
read.zoo(New_DF)
## 2017-05-23 2017-06-24 2017-07-01 2017-07-05 2017-07-10 2018-01-20
## 267 242 255 61 145 73
3) sequence number Another possibility is to just ignore the dates and use 1, 2, .3, ..
zoo(New_DF$qty)
## 1 2 3 4 5 6
## 267 242 255 61 145 73
Note
Lines <- " ï..date qty
1 2017-07-05 61
2 2018-01-20 73
3 2017-07-10 145
4 2017-07-01 255
5 2017-05-23 267
6 2017-06-24 242 "
New_DF <- read.table(text = Lines)
Could you share some background about your data. Also if there are some duplicates in the data, can you just sum them up, so that the above error won't occur.

Draw 18 plots on one graph in R?

I have a dataframe with 18 column and I want to see seasonally adjusted state of each variables on a single chart.
Here is head of my dataframe;
head(cityFootfall))
Istanbul Eskisehir Mersin
1 44280 12452 11024
2 58713 13032 12773
3 21235 5629 5749
4 20934 5968 5764
5 21667 6022 5752
6 21386 6281 5920
Ankara Bursa Adana Izmir
1 19073 5098 8256 15623
2 22812 7551 10631 18511
3 8777 2260 3733 8625
4 8798 2252 3536 8573
5 8893 2398 3641 9713
6 8765 2391 3618 10542
Kayseri Antalya Konya
1 8450 2969 4492
2 8378 4421 0
3 3491 1744 0
4 3414 1833 0
5 3596 1733 0
6 3481 1785 1154
Samsun Kahramanmaras Aydin
1 4472 4382 4376
2 4996 4773 5561
3 1662 1865 2012
4 1775 1710 1957
5 1700 1704 1940
6 1876 1848 1437
Gaziantep Sanliurfa Izmit
1 3951 3752 3825
2 5412 4707 4125
3 2021 1326 1890
4 1960 1411 1918
5 1737 1204 1960
6 1833 1143 2047
Denizli Malatya
1 2742 3809
2 3658 4346
3 1227 1975
4 1172 1884
5 1102 2073
6 1171 2060
Here is my function for this:
plot_seasonality=function(x){
par(mfrow=c(6,3))
plot_draw=lapply(x, function(x) plot(decompose(ts(x,freq=7),type="additive")$x-decompose(ts(x,freq=7),type="additive")$seasonal)
}
plot_seasonality(cityFootfall)
When I run this function I get error says: Error in plot.new() : figure margins too large but when I change my codes frompar(mfrow=c(6,3) to par(mfrow=c(3,3) its works and give me last 9 columns plot like this image but I want to see all variable in a single chart
Could anyone help me about solve my problem?
Fundamentally your windows is not big enough to plot that:
1) open a big window with dev.new(), or from Rstudio X11() under linux or quartz() under MacOSX)
2) simplify your ylab that will free space
# made up data
x <- seq(0,14,length.out=14*10)
y <- matrix(rnorm(14*10*6*3),nrow=3*6)
# large window (may use `X11()` on linux/Rstudio to force opening of new window)
dev.new(width=20,height=15)
par(mfrow=c(6,3))
# I know you could do that with `lapply` but don't listen to the fatwas
# on `for` loops it often does exactly the job you need in R
for(i in 1:dim(y)[1]){
plot(x,y[i,],xlab="Time",ylab=paste("variable",i),type="l")
}
You should also consider plotting several variables in the same graph (using lines after an initial plot).
As suggested: transform data in long format with package tidyr, see function gather:
I added a time variables since it was missing.
temp <- cityFootfall %>% transform(time = 1:nrow(temp)) %>% gather(variable, key, -time)
Now plot it with ggplot2(default settings, you can adjust this like you want)
gplot(temp, aes(x = time, y = key, group = variable, color = variable)) + geom_point() + geom_line()

How to calculate the sequential date diff in a dataframe and make it as another column for further analysis?

Please before make it as duplicate read carefully my question!
I am new in R and I am trying to figure it out how to calculate the sequential date difference from one row/variable compare to the next row/variable in based on weeks and create another field/column for making a graph accordingly.
There are couple of answer here Q1 , Q2 , Q3 but none specifically talk about making difference in one column sequentially between rows lets say from top to bottom.
Below is the example and the expected results:
Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234
Expected
Date Var1 week
2/6/2017 493 0
2/20/2017 558 2
3/6/2017 595 4
3/20/2017 636 6
4/6/2017 697 8
4/20/2017 566 10
5/6/2017 234 12
You can use a similar approach to that in your first linked answer by saving the difftime result as a new column in your data frame.
# Set up data
df <- read.table(text = "Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234", header = T)
df$Date <- as.Date(as.character(df$Date), format = "%m/%d/%Y")
# Create exact week variable
df$week <- difftime(df$Date, first(df$Date), units = "weeks")
# Create rounded week variable
df$week2 <- floor(difftime(df$Date, first(df$Date), units = "weeks"))
df
# Date Var1 week week2
# 2017-02-06 493 0.000000 weeks 0 weeks
# 2017-02-20 558 2.000000 weeks 2 weeks
# 2017-03-06 595 4.000000 weeks 4 weeks
# 2017-03-20 636 6.000000 weeks 6 weeks
# 2017-04-06 697 8.428571 weeks 8 weeks
# 2017-04-20 566 10.428571 weeks 10 weeks
# 2017-05-05 234 12.571429 weeks 12 weeks

Identifying incorrectly transformed data cells

I have a massive excel spreadsheet full of dates in %m/%d/%Y format. In R, I convert them date format using as.Date. The problem is that some of the dates in Excel were manually entered incorrectly, for example as section below where 214 was entered instead of 2014.
...
235 2014-01-20
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27
...
For individual columns, I can use the function which(dataframe$colname_X<1900) which will give me the row number. This is easy because I already know which column it is.
My question is, how can I do the same to the entire dataframe, so that I get both row and column number of the faulty cells?.
Starting with:
dat <- rd.txt("235 2014-01-20 # #function to use read.table on text
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27")
dat <- cbind(dat,dat)
dat[] <- lapply(dat, as.Date, origin="1970-01-01")
> dat
X235 X2014.01.20 X235 X2014.01.20
1 1970-08-25 2014-03-03 1970-08-25 2014-03-03
2 1970-08-26 2014-01-24 1970-08-26 2014-01-24
3 1970-08-27 2014-03-07 1970-08-27 2014-03-07
4 1970-08-28 0214-05-23 1970-08-28 0214-05-23
5 1970-08-29 2014-01-31 1970-08-29 2014-01-31
6 1970-08-30 2014-02-19 1970-08-30 2014-02-19
7 1970-08-31 2014-03-27 1970-08-31 2014-03-27
Now use which with arr.ind=TRUE (do need to convert to numeric matrix first)
which( sapply(dat,as.numeric) < (as.numeric(as.Date("1900-01-01") ) ), arr.ind=TRUE)
row col
[1,] 4 2
[2,] 4 4
One potential solution
identify all errors using apply
results <- apply(df, 2, function(x) which(x<1900))
This will return a list with each column as an element of the list. As you don't care about those that are empty (i.e. no errors) you could contract the list to only keep those with errors:
results[lapply(results,length)>0]

Resources