How do I make graphs like this in R?
Lets say I have a dataset like this:
data <- tibble(date=sample(seq(as.Date("2006-01-01"),
as.Date("2019-01-01"), by="day"),
10000, replace = T),
treatment=sample(c(0,1),10000, replace= T),
after=ifelse(date>as.Date("2015-03-01"), 1, 0),
score=rnorm(10000)+ifelse(treatment*after==1, 0.2, 0)
)
and is doing a difference in differences analysis:
did <- lm(score~treatment+after+treatment*after, data=data)
summary(did)
How can I make a plot with placebo tests?
Just using plot_model function in sjPlot.
data <- tibble(date=sample(seq(as.Date("2006-01-01"),
as.Date("2019-01-01"), by="day"),
10000, replace = T),
treatment=sample(c(0,1),10000, replace= T),
after=ifelse(date>as.Date("2015-03-01"), 1, 0),
score=rnorm(10000)+ifelse(treatment*after==1, 0.2, 0)
)
did <- lm(score~treatment+after+treatment*after, data=data)
summary(did)
sjPlot::plot_model(did,vline = 'black',show.values = T) + ylim(-.25, .5)
vline means to add a horizontal line at x = 1;
show.values means whether values should be plotted or not.
You can check the details of argument of plot_model from here.
Related
I am using the R programming language. On some bigger data, I tried the following code (make a decision tree):
#load library
library(rpart)
#generate data
a = rnorm(100, 7000000, 10)
b = rnorm(100, 5000000, 5)
c = rnorm(100, 400000, 10)
group <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5,0.5) )
group_1 <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
d = data.frame(a,b,c, group, group_1)
d$group = as.factor(d$group)
d$group_1 = as.factor(d$group_1)
#fit model
tree <- rpart(group ~ ., d)
#visualize results
plot(tree)
text(tree, use.n=TRUE, minlength = 0, xpd=TRUE, cex=.8)
In the visual output, the numbers are displayed in scientific notation (e.g. 4.21e+06). Is there a way to disable this?
I consulted this previous answer on stackoverflow:How to disable scientific notation?
I then tried the following command : options(scipen=999)
But this did not seem to fix the problem.
Can someone please tell me what I am doing wrong?
Thanks
I think the labels.rpart function has scientific notation hard-coded in: it uses a private function called formatg to do the formatting using sprintf() with a %g format, and that function ignores options(scipen). You can override this by replacing formatg with a better function. Here's a dangerous way to do that:
oldformatg <- rpart:::formatg
assignInNamespace("formatg", format, "rpart")
which replaces formatg with the standard format function. (This will definitely have dangerous side effects, so afterwards you should change it back using
assignInNamespace("formatg", oldformatg, "rpart")
A better solution would be to rescale your data. rpart switches to scientific notation only for big numbers, so you could divide the bad numbers by something like 1000 or 1000000, and describe them as being in different units. For your example, this works for me:
library(rpart)
#generate data
set.seed(123)
a = rnorm(100, 7000000, 10)/1000
b = rnorm(100, 5000000, 5)/1000
c = rnorm(100, 400000, 10)/1000
group <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5,0.5) )
group_1 <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
d = data.frame(a,b,c, group, group_1)
d$group = as.factor(d$group)
d$group_1 = as.factor(d$group_1)
#fit model
tree <- rpart(group ~ ., d)
#visualize results
plot(tree)
text(tree, use.n=TRUE, minlength = 0, xpd=TRUE, cex=.8)
Created on 2021-01-27 by the reprex package (v0.3.0)
I have some repeated measures, ordinal response data:
dat <- data.frame(
id = factor(sample(letters[1:5], 50, replace = T)),
response = factor(sample(1:7, 50, replace = T), ordered = T),
x1 = runif(n = 50, min = 1, max = 10),
x2 = runif(n = 50, min = 100, max = 1000)
)
I have built the following model:
library(ordinal)
model <- clmm(response ~ x1 + x2 + (1|id), data = dat)
I have some new data:
new_dat <- data.frame(
id = factor(sample(letters[1:5], 5, replace = T)),
x1 = runif(n = 5, min = 1, max = 10),
x2 = runif(n = 5, min = 100, max = 1000)
)
I want to be able to use the model to predict the probability of each level of dat$response occurring for new_dat, whilst still also accounting for id.
Unfortunately predict() does not work for clmm objects. predict() does work for clmm2 objects but it ignores any random effects included.
What I want to achieve is something similar to what has been done in Figure 3 of the following using this code:
library(ordinal)
fm2 <- clmm2(rating ~ temp + contact, random=judge, data=wine, Hess=TRUE, nAGQ=10)
pred <- function(eta, theta, cat = 1:(length(theta)+1), inv.link = plogis){
Theta <- c(-1e3, theta, 1e3)
sapply(cat, function(j)
inv.link(Theta[j+1] - eta) - inv.link(Theta[j] - eta))
}
mat <- expand.grid(judge = qnorm(0.95) * c(-1, 0, 1) * fm2$stDev,
contact = c(0, fm2$beta[2]),
temp = c(0, fm2$beta[1]))
pred.mat <- pred(eta=rowSums(mat), theta=fm2$Theta)
lab <- paste("contact=", rep(levels(wine$contact), 2), ", ", "temp=", rep(levels(wine$temp), each=2), sep="")
par(mfrow=c(2, 2))
for(k in c(1, 4, 7, 10)) {
plot(1:5, pred.mat[k,], lty=2, type = "l", ylim=c(0,1),
xlab="Bitterness rating scale", axes=FALSE,
ylab="Probability", main=lab[ceiling(k/3)], las=1)
axis(1); axis(2)
lines(1:5, pred.mat[k+1, ], lty=1)
lines(1:5, pred.mat[k+2, ], lty=3)
legend("topright",
c("avg. judge", "5th %-tile judge", "95th %-tile judge"),
lty=1:3, bty="n")
}
Except, my model contains multiple continuous covariates (as opposed to binary factors).
How can I use the model data to predict the probability of each level of dat$response occurring for new_dat, whilst still also accounting for id?
Many thanks.
Using the survey package, I am having issues creating an imputationList that svydesign will accept. Here is a reproducible example:
library(tibble)
library(survey)
library(mitools)
# Data set 1
# Note that I am excluding the "income" variable from the "df"s and creating
# it separately so that it varies between the data sets. This simulates the
# variation with multiple imputation. Since I am using the same seed
# (i.e., 123), all the other variables will be the same, the only one that
# will vary will be "income."
set.seed(123)
df1 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 2
set.seed(123)
df2 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 3
set.seed(123)
df3 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Create list of imputed data sets
impList <- imputationList(df1,
df2,
df3)
# Apply NHIS weights
weights <- svydesign(id = ~id,
weight = ~pweight,
data = impList)
I get the following error:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
To get it to work, I needed to directly add imputationList to svydesign as follows:
weights <- svydesign(id = ~id,
weight = ~pweight,
data = imputationList(list(df1,
df2,
df3))
the step by step instructions available at http://asdfree.com/national-health-interview-survey-nhis.html walk through exactly how to create a multiply-imputed nhis design, and the analysis examples below that include svyglm calls. avoid using library(data.table) and library(dplyr) with library(survey)
I am working with survival analysis and the smoothHR package, after modeling I'd like to plot the relative risk vs a variable, thing that is quite easy with
plot(dataset, predictor)
But I'd like to do it using the ggplot package. Any idea how to?
#the library
library(smoothHR)
#the artificial dataset
surv.days<- runif(n = 200, min = 100, max = 500)
censor<- sample(c(0,1), 200, replace=TRUE)
surv.var<- surv.days/10 + rnorm(200, mean = 0, sd = 3)
surv.var[which(surv.days>250)]<- surv.days[which(surv.days>250)]/5 + rnorm(length(which(surv.days>250)), mean = 0, sd = 10)
survdata<- data.frame(surv.days, censor, surv.var)
rm(censor, surv.days, surv.var)
#using smoothHR package to adjust a model
variabledf<-dfmacox (time = "surv.days", status = "censor",
nl.predictor = c ("surv.var"),
smoother = "ns",
method = "AIC",
data = survdata)
coxmodel<- coxph(Surv(surv.days, censor) ~ ns(surv.var, variabledf$df[1]), data = survdata, x = TRUE)
c.smoothhr<-smoothHR (data = survdata, coxfit = coxmodel)
After that, I can plot the risk as a function of the survival variable
plot (c.smoothhr, predictor = "surv.var", conf.level = 0.95, ref.label = "", main = "", xlab = "surv.var")
I would like to generate this plot using the ggplot2 package, for storing and customization purposes; but I am simply clueless about how to proceed.
I am struggling with data manipulation in R. My dataset consists of variables type(5 factors), intensity(3 factors), damage(continous). I want to calculate mean damage(demage1, demage2 and damage3 separately) with respect to intensity and type. In onther words I want to summarize the average damage by type and intensity. I have created this small reproducible example of my data:
type <- sample(seq(from = 1, to = 5, by = 1), size = 50, replace = TRUE)
intensity <- sample(seq(from = 1, to = 3, by = 1), size = 50, replace = TRUE)
damage1 <- sample(seq(from = 1, to = 50, by = 1), size = 50, replace = TRUE)
damage2 <- sample(seq(from = 1, to = 200, by = 1), size = 50, replace = TRUE)
damage3 <- sample(seq(from = 1, to = 500, by = 1), size = 50, replace = TRUE)
dat <- cbind(type, intensity, damage1, damage2, damage3)
then to manipulate the data I have used the pipe operator %>% buy my commands seem not to work very well:
dat <- as.data.frame(dat)
dat %>%
filter(type == 1) %>%
group_by(intensity, damage) %>%
summarise(mean_damage = mean(Value))
I have read about multiple usefull functions here:
efficient reshaping using data tables
manipulating data tables
Do Faster Data Manipulation using These 7 R Packages
But I wasnt able to make any progress here. My question are:
What is wrong with my code?
Am I even going in the right direction here?
Is there some alternative how to do this?