How to increase the font size of a legend? - r

Or even the words in the plot itself? Any hints on that are welcome.
dat <- selectByDate(mydata, year = 2003)
dat <- data.frame(date = mydata$date, obs = mydata$nox, mod = mydata$nox)
dat <- transform(dat, month = as.numeric(format(date, "%m")))
mod1 <- transform(dat, mod = mod + 10 * month + 10 * month * rnorm(nrow(dat)),model = "model 1")
mod1 <- transform(mod1, mod = c(mod[5:length(mod)], mod[(length(mod) - 3) :
length(mod)]))
mod2 <- transform(dat, mod = mod + 7 * month + 7 * month * rnorm(nrow(dat)),
model = "model 2")
mod.dat <- rbind(mod1, mod2)

Much of this appears to have been hard coded, so I don't think modifying this plot will be easy in general. In the specific case of the legend text, you can modify some arguments in the plot object after creating it:
out <- TaylorDiagram(mod.dat, obs = "obs", mod = "mod", group = "model")
out$plot$legend$right$args$key$text$cex <- 1.5
out$plot$legend$right$args$key$cex.title <- 1.5
I don't see anything similar that only applies to the text in the plot itself. To modify that you'd likely have to dig further into the code itself and modify it to get the specific results you want.
Indeed, digging further, much of the details of the plot are taking place in custom panel functions panel.taylor.setup and panel.taylor in which almost all of the specific sizes of things are hard coded.

Related

ggplot legend values rescale

I would like to rescale the values on the legend of a plot coming from conditional_effects.
By doing something like this
plot(conditional_effects(brm_c_5, effects = "t:w_c_ratio",cond = conditions5), rug = T, points = T)
I'm getting the following
For time being I'm doing
p_col_1 <- ggplot_build(p_col_1)
and then I'm chainging the ranges in here p_col_1$plot$scales$scales[[3]]$range$range and here p_col_1$plot$scales$scales[[4]]$range$range but I'm not trusting this solution.
EXAMPLE:
As example please see this code. The defaults values for kidney$age is from 10 to 69 but let's say that I want to rescale it from -1 to 1. Then I could use the solution via ggplot_build but I'm looking for a smarter and more elegant solution.
library(brms)
fit1 <- brm(time | cens(censored) ~ age + sex + disease,
data = kidney, family = weibull, init = "0")
fit1
p_tr <- (plot(conditional_effects(fit1, effects = "disease:age"), rug = T, points = T)[[1]])
p_tr <- ggplot_build(p_tr)
p_tr$plot$scales$scales[[3]]$range$range <- c(1,0, -1) %>% as.character()
p_tr$plot$scales$scales[[4]]$range$range <- c(1, 0 ,-1)%>% as.character()
plot(p_tr %>% ggplot_gtable)
`
How could I rescale the values of w_c_ratio from -0.9:+0.9 in the original scale (which is going from 2 to 10)?

loop through gtsummary table to pick out only significant variables

I have a question. I am, relatively new to R. I am transitioning some code from another app to R. In that code, I was able to loop through a table and pick out only the significant variables based on the p-value and the size of the odds ratio for logistic regression. Then I was able to say something like "x had a significant link with y" when the p was less than or equal to 0.05 and the odds ratio as above 1.00 and do the converse "x had a significant negative link with " when the p value was less than 0.05 and the odds ration was below 1.00. Then, I was able to do what I understand from the gtsummary literature is inline_text these statements. As I am trying to get my bearings with R, I was wondering how I would I accomplish this with gtsummary tables? My reproducible code does not work, but it is below:
# install.packages("gtsummary")
library(gtsummary)
library(tidyverse)
#simulated data
gender <- sample(c(0,1), size = 1000, replace = TRUE)
age <- round(runif(1000, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 1000, size = 1, prob = p)
mod <- glm(y ~ gender + age, family = "binomial")
summary(mod)
#create the gtsummary table
tab1 = mod %>%
tbl_regression(exponentiate = TRUE) %>%
as_gt() %>%
gt::tab_source_note(gt::md("*This data is simulated*"))
#attempt of going through the gtsummary table
for (i in 1:nrow(tab1[1:3,])) { # does one row at a time
pv = tab1[["_data"]]$p.value
num = tab1[i, "pv"]
name = tab1[i, "variable"]
if(pv <=0.05 ){
cat("The link between", name, "and is significant. ")
}
}
I ask about the gtsummary regression table because, I will have to do the same thing with the tbl_summary as well. I thought I would begin with the regression version. The idea is to get the gorgeous inline_text via an if else. All of this is triggered by the going down the p-value column, and then pulling the name of the variable and the amazing inline_text information into the sentence. I have looked through the available questions others have asked, but I haven't found anything that gets to the heart of this. If I have missed it, please, point me in the correct direction.
There is a data frame in every gtsummary table called x$table_body. I think it's easier to extract the information you need from there. Example below! (you could also wrap the last line in an inline_text() if that is better for you).
# install.packages("gtsummary")
library(gtsummary)
#> #BlackLivesMatter
library(tidyverse)
#simulated data
gender <- sample(c(0,1), size = 1000, replace = TRUE)
age <- round(runif(1000, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 1000, size = 1, prob = p)
mod <- glm(y ~ gender + age, family = "binomial")
#create the gtsummary table
tab1 = mod %>% tbl_regression(exponentiate = TRUE)
# extract the variable names and the pvalues
tab1$table_body %>%
select(variable, p.value) %>%
filter(p.value <= 0.05) %>% # only keep the sig pvalues
deframe() %>%
imap(~str_glue("The link between 'y' and {.y} is significant ({style_pvalue(.x, prepend_p = TRUE)})."))
#> $gender
#> The link between 'y' and gender is significant (p<0.001).
#>
#> $age
#> The link between 'y' and age is significant (p<0.001).
Created on 2022-11-07 with reprex v2.0.2

Fitting a sine wave model on POSIXt data and plotting using Ggplot2

Long-time reader, first-time asker here :)
I have some data collected at specific times and dates, and there is reason to hypothesize the data roughly follows a 24-hour cycle. I would like to fit a sine wave model on my data as a function of time, so that it is possible to test if future data points fall on the predicted pattern.
I have read this, this and this response but they are not solving my problem because in my case, I'm hoping to keep the x-axis data in POSIXct date-time format. That's how the data is collected and using this format makes for an easily interpreted plot.
Here's some reproducible data that is identical to my real data:
time <- c("2022-01-01 09:20:00", "2022-01-02 11:10:00",
"2022-01-02 18:37:00", "2022-01-03 14:01:00",
"2022-01-05 06:50:00", "2022-01-06 17:03:00")
time <- as.POSIXct(time)
value <- c(3, 6, 2, 8, 4, 1)
These are plotted fine in base R:
plot(time, value)
However, now I run into trouble when I try to construct a sine regression model that would fit the time series. I'm also struggling to fully understand the parameters required by nls function. Based on the previous examples, I have tried this approach (with comments on how I understand it working):
res <- nls(value ~ A * sin(omega * time + phi) + C, # This is the basic sine-function format
data = data.frame(time, value), # This defines the data used
start = list(A = 1, omega = 1, phi = 1, C = 1)) # This gives nls the starting values?
Here, I get an error message: "Error in Ops.POSIXt(omega, time) : '*' not defined for "POSIXt" objects" which I interpret as meaning the specific date format I would like to use is not acceptable for this type of approach. I know this, because if I simply replace the time variable with a dummy vector of integers, the model works fine and I'm able to plot it as follows:
time2 <- c(1, 2, 3, 4, 5, 6)
res <- nls(value ~ A * sin(omega * time2 + phi) + C,
data = data.frame(time, value),
start=list(A=1, omega=1, phi=1, C=1))
coefs <- coef(res)
fit <- function(x, a, b, c, d) {a * sin(b * x + c) + d}
plot(time2, value)
curve(fit(x, a = coefs["A"], b = coefs["omega"],
c = coefs["phi"], d = coefs["C"]), add=TRUE,
lwd=2, col="red")
I know I'm on the right track but my main question is, how can I do the above process while maintaining the time variable in POSIXct format?
As mentioned, my main order of business would be to plot the data using Ggplot2, but I can't even begin to try that before I solve the initial problem. However, any pointers on how to get started with that are greatly appreciated! :)
I would probably just generate a numeric number of days from an arbitrary origin time and use that. You can then modify your fit function so that it converts date-times to predicted values. You can then easily make a data frame of predictions from your model and plot that.
df <- data.frame(time = time, value = value)
origin <- as.POSIXct("2022-01-01 00:00:00")
df$days <- as.numeric(difftime(time, origin, unit = "day"))
res <- nls(value ~ A * sin(omega * days + phi) + C,
data = df,
start = list(A = 1, omega = 1, phi = 1, C = 1))
fit <- function(res, newdata) {
x <- as.numeric(difftime(origin, newdata$time, units = "days"))
C <- as.list(coef(res))
C$A * sin(C$omega * x + C$phi) + C$C
}
new_df <- data.frame(time = origin + as.difftime(new_times, units = "days"))
new_df$value <- fit(res, new_df)
ggplot(df, aes(time, value)) +
geom_point() +
geom_line(data = new_df, colour = "gray") +
theme_bw()

Is there a way to test a range of exponents in a lm() model in the same way as the code below more efficiently?

The basic gist is that I have a set of housing data that I need to create a model for to minimize the predicted price vs actual price of house based on the dataset. So I created this bit of code to essentially test for a range of different numerators and find the one that minimized the difference between them. I'm using the median instead of the mean as the data isn't exactly normal.
Since I only have experience with lm(), I'm using that to create the coefficients and C values. But since the model likes exponents, I have to also test various exponents. It does this for each of the variables and then goes back to the first and re-evaluates it based on the other exponents. The model starts out with all the exponents ending up equal to 1. So the same as the basic linear model. I know that this is probably horribly inefficient and probably uses a lot of code in a somewhat wasteful, but I'm in my first r class so sorry about the mess and/or convoluted coding logic.
Is there any way to do this same thing but being more efficient. Also, I can't really decrease the number of variables as the model likes having more variables and produces a greater margin of error when they aren't present.
w <- seq(1,10000,1)
r <- seq(1,10000,1)
t <- seq(1,10000,1)
z <- seq(1,10000,1)
s <- seq(1,10000,1)
coef_1 <- c(6000,6000,6000,6000,6000,6000,6000,6000)
v <- rep(6000, each = 8)
for(l_1 in 1:10){
for(t_1 in 1:8){
for(i in 1:10000){
t = t_1
coef_1[t] = i
mod5 <- lm(log(SALE_PRC) ~ I(TOT_LVG_AREA^((coef_1[1]-5000)/1000)) + I(LND_SQFOOT^((coef_1[2]-5000)/1000)) + I(RAIL_DIST^((coef_1[3]-5000)/1000)) + I(OCEAN_DIST^((coef_1[4]-5000)/1000)) + I(CNTR_DIST^((coef_1[5]-5000)/1000)) + I(HWY_DIST^((coef_1[6]-5000)/1000)) + I(structure_quality^((coef_1[7]-5000)/1000)) + SUBCNTR_DI + SPEC_FEAT_VAL + (exp(((coef_1[8]-5000)/1000)*SPECIAL_RATIO)) + age, data = kaggle_transform_final)
kaggle_new <- kaggle_transform_final %>%
add_predictions(model = mod5, var = "prediction") %>%
mutate(new_predict = exp(prediction)) %>%
mutate(new_difference = abs((new_predict-SALE_PRC))/SALE_PRC) %>%
mutate(average_percent_difference = median(new_difference)) %>%
mutate(mean_percent_difference = mean(new_difference)) %>%
mutate(quart_75 = quantile(new_difference,.75))
w[i] = kaggle_new$average_percent_difference[1]
r[i] = kaggle_new$mean_percent_difference[1]
t[i] = kaggle_new$quart_75[1]
z[i] = i
s[i] = (i-5000)/1000
if(i%%100 ==0){show(i)}
}
u <- data.frame(median_diff = w, mean_diff = r, quart_75 = t, actual = s, number = z) %>%
arrange(median_diff)
coef_1[t_1] <- u$number[1]
v[t_1] <- u$actual[1]
show(coef_1)
}
coef_1 <- coef_1
}

Why do i get ''not defined because of singularities''?

So I am running a survival analysis on my dataset of google playstor downloads.
My analysis using survreg only provides me with nas for coefficients though.
"(5 not defined because of singularities)"
If I use a normal lm regression this problem does not occur. This would not work however since all observations of the dependent variable are right censored for a different number (the numeric value is also the limit).
My original dataset: https://www.kaggle.com/lava18/google-play-store-apps
So here I will show you my entire code. It might be a bit long so scroll to the end for the survival analysis, but I wanted to give you the ability to fully comprehend.
library(readxl)
Dataset <- read_excel("Thesis/googleplaystore.xlsx")
View(Dataset)
#selecteer 500 apps
set.seed(1998)
dataset <- Dataset[sample(nrow(Dataset), 500), ]
View(dataset)
#Lastupdated --> days_since
end <- matrix( c("2018-08-31"), nrow=500, ncol=1, byrow=FALSE)
end <- format(as.Date(end), "%Y/%m/%d")
View(end)
dataset$`Last Updated` <- as.Date(dataset$`Last Updated`,
format = "%B %d, %Y")
dataset$`Last Updated` <- format(as.Date(dataset$`Last Updated`), "%Y/%m/%d")
View(dataset)
install.packages('lubridate')
library(lubridate)
elapsed.time <- dataset$`Last Updated` %--% end
View(elapsed.time)
dataset$days_since <- as.duration(elapsed.time) / ddays(1)
View(dataset)
# + verwijdern uit aantal installs
dataset$Install <- gsub("\\+","", dataset$Installs)
View(dataset)
dataset$Install <- gsub(",","", dataset$Install)
# installs en price numeric maken
typeof(dataset$Install)
dataset$Install <- as.numeric(dataset$Install)
View(dataset)
typeof(dataset$Rating)
dataset$Rating <- as.numeric(dataset$Rating)
typeof(dataset$Reviews)
typeof(dataset$Price)
dataset$Price <- gsub("\\$","", dataset$Price)
dataset$Price <- as.numeric(dataset$Price)
typeof(dataset$days_since)
#Tobit Survival analyses
library(help=survival)
library(survival)
dataset$ins_cen <- matrix( c("0"), nrow=500, ncol=1, byrow=FALSE)
typeof(dataset$ins_cen)
dataset$ins_cen <- as.numeric(dataset$ins_cen)
install.packages('tidyverse')
library(tidyverse)
dataset_2 <- dataset %>% filter(!is.na(dataset$Rating))
View(dataset_2)
dataset_2$dum_cen <- ifelse(dataset_2$ins_cen == 0, 0, 1)
dataset_2$dum_fac <- as.factor(dataset_2$dum_cen)
survreg(Surv(Install, ins_cen, type= 'right') ~ Rating + Price + Reviews + days_since,
dist="gaussian", data = dataset_2)
cor(dataset)
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Size + Reviews +days_since + `Current Ver` + Category, data = dataset, dist = 'gaussian', right = dataset_2$Install)
I tried turning the event into an dummy variable and a factor but both options do not work. The dummy variable changes nothing, while the factor variable gives an error.
Error in survreg(Surv(Install, dum_fac, type = "right") ~ Rating +
Price + : multi-state survival is not supported
Thanks for any help.
Sorry if I am asking stupid questions but I am still learning and can't figure my problem out.
p.s. I also tried to solve my problem using crch() but this lead to a different error, where I can't seem to wrap my head around either.
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Edit: I noticed I left character variables in the crch code.
When this is removed from the formula I get a different error.
Error in solve.default(hessfun(par)) : system is computationally
singular: reciprocal condition number = 7.31468e-142
CRCH code:
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Reviews +days_since, data = dataset, dist = 'gaussian', left = -Inf, right = dataset_2$Install)
x = Price + Size + Reviews +days_since + `Current Ver` + Category

Resources