I'm working with a dataset that is typically analyzed using start and end dates that differ from a typical calendar year (water years run October 1 to Sept 30). I have no problem plotting the data in ggplot but as soon as I try to use facet_grid to break up the years, the data get misrepresented.
#Create data set
WY18to21 <- as.data.frame(seq(as.Date("2017-10-01"), as.Date("2021-09-30"), "days"))
names(WY18to21) <- c("Date")
WY18to21$Year <- year(WY18to21$Date)
WY18to21$Temp <- c(rep(seq(1,25, by=0.1), times=6), 1:15)
#Create "yearless" date for facet plot
WY18to21$Date_md <- WY18to21$Date
year(WY18to21$Date_md) <- 2000
#Set Water Year (WY)
WY18to21$WY <- "NA"
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2017-10-01") &
WY18to21$Date <= as.Date("2018-09-30"), "WY18",WY18to21$WY)
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2018-10-01") &
WY18to21$Date <= as.Date("2019-09-30"), "WY19",WY18to21$WY)
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2019-10-01") &
WY18to21$Date <= as.Date("2020-09-30"), "WY20",WY18to21$WY)
WY18to21$WY <- ifelse(WY18to21$Date >= as.Date("2020-10-01") &
WY18to21$Date <= as.Date("2021-09-30"), "WY21",WY18to21$WY)
#Plot - regular ggplot
x <- ggplot(WY18to21) +
geom_line(aes(Date, Temp)) +
xlab("") + ylab("[°C]") +
ylim(0,26) +
theme_bw() +
scale_x_date(date_labels = "%b")
x
This looks okay, but I find the aesthetic of facet plots easier to interpret
x <- ggplot(WY18to21) +
geom_line(aes(Date_md, Temp)) +
xlab("") + ylab("[°C]") +
ylim(0,26) +
theme_bw() +
scale_x_date(date_labels = "%b")
x_facet <- x + facet_grid(. ~ WY)
x_facet
It looks like the facet_grid function first splits the data by water year (WY) and then orders the data by the "yearless" date. In practice this means the beginning of the next water year gets moved to the beginning of the current water year.
What I need to do is figure out either:
(a) how do I use facet_grid with dates without removing the year OR
(b) how do I redefine the order dates are plotted so that the year begins on October 1.
Attempts at problem solving: I think the solution has something to do with the mutate function but I have yet to successfully figure that out. I also tried searching under "fiscal year" since often they are also different than calendar years. But no luck. Thanks for the help!
Related
I have a measurement of an object with xyz coordinates and a timeline in ms. My CSV looks like this:
TimeInSec X.6 Y.6 Z.6
0.000000 -1.383422 -0.058891 0.023290
0.004167 -1.387636 -0.058947 0.023584
0.008333 -1.391491 -0.058972 0.023989
...
I want to find the row in my dataset where the xyz coordinates stop changing (within a threshold). The key feature I want is a time from row 0 to the stop point of my object.
My Code:
dummy.data <- read.csv (file="D:\\tmp\\dummy.csv", header = TRUE, skip = 6
dummy.data %>%
gather(key,value, X.6, X.7, X.8, Y.6, Y.7, Y.8, Z.6, Z.7, Z.8) %>%
ggplot(aes(x=Time..Seconds., y=value, colour=key)) +
geom_line()
Many Thanks for your help!
Sample Graph:
Sample Graph
Here is the link to the RawData CSV RawData
Here's an updated example that uses exactly the same code as before but now I made some dummy data that shows different offsets and the data settles to a constant value eventually. The point is that successive points will get closer and closer so a Euclidean distance (think of this as the actual distance) between successive points will get smaller. Once below the threshold, we declare the points to have settled.
library(tidyverse)
library(ggplot2)
numberofpoints <- 100
threshold <- 0.01
set.seed(1)
dummy.data <- # make some dummy data with offsets
data.frame(
X.6=runif(numberofpoints), X.7=runif(numberofpoints), X.8=runif(numberofpoints),
Y.6=runif(numberofpoints), Y.7=runif(numberofpoints), Y.8=runif(numberofpoints),
Z.6=runif(numberofpoints), Z.7=runif(numberofpoints), Z.8=runif(numberofpoints)) %>%
mutate(
X.6=3+X.6/row_number(), X.7=1+X.7/row_number(), X.8=2+X.8/row_number(),
Y.6=4+Y.6/row_number(), Y.7=6+Y.7/row_number(), Y.8=9+Y.8/row_number(),
Z.6=5+Z.6/row_number(), Z.7=7+Z.7/row_number(), Z.8=10+Z.8/row_number()
)
distances <- dist(dummy.data) # find distances between all pairs of readings (will be slow for large data)
distances.matrix <- as.matrix(distances)
# distances between adjacent readings
distancechange <- c(NA,unlist(sapply(1:numberofpoints-1, function(r) distances.matrix[r,r+1])))
# the first point below the threshold
changebelowthreshold <- min(which(distancechange < threshold))
# Plot something
dummy.data$Time <- 1:nrow(dummy.data)
thresholdtime <- dummy.data$Time[changebelowthreshold]
plotdata <- dummy.data %>% pivot_longer(cols=c(X.6, X.7, X.8, Y.6, Y.7, Y.8, Z.6, Z.7, Z.8))
gg <- ggplot(plotdata, aes(x=Time, y=value, colour=name)) + geom_line() + geom_vline(xintercept = thresholdtime)
This makes the following plot.
The vertical line shows where the data is below a threshold.
I think from your description you just want to identify a point where the differences between x, y and z values in consecutive rows drop below a certain threshold:
threshold <- 0.001
stop_row <- which(abs(diff(dummy.data$X.6)) < threshold &
abs(diff(dummy.data$Y.6)) < threshold &
abs(diff(dummy.data$Z.6)) < threshold )
So now you can do:
result <- dummy.data$TimeInSec[stop_row] - dummy.data$TimeInSec[1]
So I am running a survival analysis on my dataset of google playstor downloads.
My analysis using survreg only provides me with nas for coefficients though.
"(5 not defined because of singularities)"
If I use a normal lm regression this problem does not occur. This would not work however since all observations of the dependent variable are right censored for a different number (the numeric value is also the limit).
My original dataset: https://www.kaggle.com/lava18/google-play-store-apps
So here I will show you my entire code. It might be a bit long so scroll to the end for the survival analysis, but I wanted to give you the ability to fully comprehend.
library(readxl)
Dataset <- read_excel("Thesis/googleplaystore.xlsx")
View(Dataset)
#selecteer 500 apps
set.seed(1998)
dataset <- Dataset[sample(nrow(Dataset), 500), ]
View(dataset)
#Lastupdated --> days_since
end <- matrix( c("2018-08-31"), nrow=500, ncol=1, byrow=FALSE)
end <- format(as.Date(end), "%Y/%m/%d")
View(end)
dataset$`Last Updated` <- as.Date(dataset$`Last Updated`,
format = "%B %d, %Y")
dataset$`Last Updated` <- format(as.Date(dataset$`Last Updated`), "%Y/%m/%d")
View(dataset)
install.packages('lubridate')
library(lubridate)
elapsed.time <- dataset$`Last Updated` %--% end
View(elapsed.time)
dataset$days_since <- as.duration(elapsed.time) / ddays(1)
View(dataset)
# + verwijdern uit aantal installs
dataset$Install <- gsub("\\+","", dataset$Installs)
View(dataset)
dataset$Install <- gsub(",","", dataset$Install)
# installs en price numeric maken
typeof(dataset$Install)
dataset$Install <- as.numeric(dataset$Install)
View(dataset)
typeof(dataset$Rating)
dataset$Rating <- as.numeric(dataset$Rating)
typeof(dataset$Reviews)
typeof(dataset$Price)
dataset$Price <- gsub("\\$","", dataset$Price)
dataset$Price <- as.numeric(dataset$Price)
typeof(dataset$days_since)
#Tobit Survival analyses
library(help=survival)
library(survival)
dataset$ins_cen <- matrix( c("0"), nrow=500, ncol=1, byrow=FALSE)
typeof(dataset$ins_cen)
dataset$ins_cen <- as.numeric(dataset$ins_cen)
install.packages('tidyverse')
library(tidyverse)
dataset_2 <- dataset %>% filter(!is.na(dataset$Rating))
View(dataset_2)
dataset_2$dum_cen <- ifelse(dataset_2$ins_cen == 0, 0, 1)
dataset_2$dum_fac <- as.factor(dataset_2$dum_cen)
survreg(Surv(Install, ins_cen, type= 'right') ~ Rating + Price + Reviews + days_since,
dist="gaussian", data = dataset_2)
cor(dataset)
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Size + Reviews +days_since + `Current Ver` + Category, data = dataset, dist = 'gaussian', right = dataset_2$Install)
I tried turning the event into an dummy variable and a factor but both options do not work. The dummy variable changes nothing, while the factor variable gives an error.
Error in survreg(Surv(Install, dum_fac, type = "right") ~ Rating +
Price + : multi-state survival is not supported
Thanks for any help.
Sorry if I am asking stupid questions but I am still learning and can't figure my problem out.
p.s. I also tried to solve my problem using crch() but this lead to a different error, where I can't seem to wrap my head around either.
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Edit: I noticed I left character variables in the crch code.
When this is removed from the formula I get a different error.
Error in solve.default(hessfun(par)) : system is computationally
singular: reciprocal condition number = 7.31468e-142
CRCH code:
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Reviews +days_since, data = dataset, dist = 'gaussian', left = -Inf, right = dataset_2$Install)
x = Price + Size + Reviews +days_since + `Current Ver` + Category
I am trying to use the group= option in geom_boxplot and it works for one grouping function, but not for the another. First plot runs, 2nd and 3rd plots (really same, called differently) both fail to produce 2-month boxplots for pre 2017 and one-month boxplots for 2017, as the grouper intends. For grouper function ggplot declares Warning message: position_dodge requires non-overlapping x intervals " but X value is same across graphs. Clearly related to my groupdates function, but groups appear to be constructed properly. Suggestions welcome. With thanks.
library(tidyverse)
library(lubridate)
# I want two month groups before 2017, and one-month groups in 2017
groupdates <- function(date) {
month_candidate <-case_when(
year(date) < 2017 ~ paste0(year(date), "-", (floor(((0:11)/12)*6)*2)+1),
TRUE ~ paste0(year(date), "_", month(date))
)
month_candidate2 <-case_when(
(str_length(month_candidate)==6) ~ paste0(str_sub(month_candidate,1,5), "0", str_sub(month_candidate,6)),
TRUE ~ month_candidate
)
return(month_candidate2)
}
generate_fake_date_time <- function(N, st="2015/01/02", et="2017/02/28") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
n=5000
set.seed(250)
test <-as.data.frame(generate_fake_date_time(n))
colnames(test) <- "posixctdate"
test$ranvalue <- month(test$posixctdate)+runif(length(test), 0,1)
test$grouped_time <-groupdates(test$posixctdate)
table(test$grouped_time)
ggplot(test)+geom_boxplot(aes(x=posixctdate, y=ranvalue, group=paste0(year(posixctdate), "_", month(posixctdate))))
#ggplot(test)+geom_violin(aes(x=posixctdate, y=ranvalue, group=junk))
ggplot(test)+geom_boxplot(aes(x=posixctdate, y=ranvalue, group=grouped_time))
ggplot(test)+geom_boxplot(aes(x=posixctdate, y=ranvalue, group=groupdates(posixctdate)))
sessionInfo()
If I correctly understood your problem, you should think about modifying your groupdates function.
I only modified the 3rd line using :
ceiling instead of floor
month(date) instead of 0:11
Resulting in :
groupdates <- function(date) {
month_candidate <-case_when(
year(date) < 2017 ~ paste0(year(date), "-", (ceiling(((month(date))/12)*6)*2)+1),
TRUE ~ paste0(year(date), "_", month(date))
)
month_candidate2 <-case_when(
(str_length(month_candidate)==6) ~ paste0(str_sub(month_candidate,1,5), "0", str_sub(month_candidate,6)),
TRUE ~ month_candidate
)
return(month_candidate2)
}
I also modified the computation of ranvalue to have a better distribution, I bet you wanted to use nrow instead of length :
test$ranvalue <- month(test$posixctdate) + runif(nrow(test), 0, 1)
test$grouped_time <-groupdates(test$posixctdate)
table(test$grouped_time)
And the output (no changes) :
ggplot(test)+geom_boxplot(aes(x=posixctdate, y=ranvalue, group=grouped_time))
Or even the words in the plot itself? Any hints on that are welcome.
dat <- selectByDate(mydata, year = 2003)
dat <- data.frame(date = mydata$date, obs = mydata$nox, mod = mydata$nox)
dat <- transform(dat, month = as.numeric(format(date, "%m")))
mod1 <- transform(dat, mod = mod + 10 * month + 10 * month * rnorm(nrow(dat)),model = "model 1")
mod1 <- transform(mod1, mod = c(mod[5:length(mod)], mod[(length(mod) - 3) :
length(mod)]))
mod2 <- transform(dat, mod = mod + 7 * month + 7 * month * rnorm(nrow(dat)),
model = "model 2")
mod.dat <- rbind(mod1, mod2)
Much of this appears to have been hard coded, so I don't think modifying this plot will be easy in general. In the specific case of the legend text, you can modify some arguments in the plot object after creating it:
out <- TaylorDiagram(mod.dat, obs = "obs", mod = "mod", group = "model")
out$plot$legend$right$args$key$text$cex <- 1.5
out$plot$legend$right$args$key$cex.title <- 1.5
I don't see anything similar that only applies to the text in the plot itself. To modify that you'd likely have to dig further into the code itself and modify it to get the specific results you want.
Indeed, digging further, much of the details of the plot are taking place in custom panel functions panel.taylor.setup and panel.taylor in which almost all of the specific sizes of things are hard coded.
I'd like to add a curve to a plot I'm making with ggplot, but I only want the curve to appear for a particular domain.
I've tried various approaches using stat_function:
data <- data.frame(Date = ..., cases = ...)
end_date <- ... ## calculated from a date (e.g., Sys.Date()) minus an offset
start_date <- ... ## end_date - some offset
p1 <- ggplot(data) + aes(x=Date, y=cases) + ... ## data has Date, cases columns
p1 + stat_function(...something..., fun=function(t) ...)
where for something I've tried to put a new, subsetted chunk of data:
data = data[(start_date <= data$Date) & (data$Date <= end_date),] ## no change
and a new aes
aes = aes(xmin = start_date, xmax = end_date)
## error - thinks start_date / end_date don't exist,
## though they are declared earlier
Any suggestions? I've also fiddled around with annotate("path", ...) but nothing concrete there. I feel like this should be something easy, I just don't have my head around the "ggplot way" to make it happen.
It may also be relevant that I'm making these plots in a shiny application, though aside from funny crap w/ data.table, I haven't noticed that affecting anything.
The following seems to work, though it still feels very hacky to me:
data$fit <- ... # evaluate function on Date
relrows <- (start_date <= data$Date) & (data$Date <= end_date)
p1 <- p1 + annotate("line", y=data$fit[relrows], x=data$Date[relrows])
Try adding another label as a new column in your dataframe.
df$newlabel[(start_date <= data$Date) $ (data$Date <= end_date)]<-a
then add groups to your ggplot
p1 <- ggplot(data)
+ aes(x=Date, y=cases, group=newlabel, colour=newlabel)
+ geom_point()
+ stat_smooth(method = "lm", formula = y ~ poly(x,2), size=1)