I have 5 different survfit() plots of different models that I am trying to combine into one plot in the style of a landmark analysis plot, as seen below.
At the moment they are just plot(survfit(model, newdata = )), how could I combine them so that I have the line of days 0-100 of survfit 1, 100-200 of survfit 2 etc.
Let's create a model using the built-in lung data from the survival package:
library(survival)
library(tidyverse)
mod1 <- survfit(Surv(time, status) ~ sex, data = lung)
This model actually contains all we need to make the plot. We can convert it to a data frame as follows:
df <- as.data.frame(unclass(mod1)[c(2:7, 15:16)])
df$sex <- rep(c("Male", "Female"), times = mod1$strata)
head(df)
#> time n.risk n.event n.censor surv std.err lower upper sex
#> 1 11 138 3 0 0.9782609 0.01268978 0.9542301 1.0000000 Male
#> 2 12 135 1 0 0.9710145 0.01470747 0.9434235 0.9994124 Male
#> 3 13 134 2 0 0.9565217 0.01814885 0.9230952 0.9911586 Male
#> 4 15 132 1 0 0.9492754 0.01967768 0.9133612 0.9866017 Male
#> 5 26 131 1 0 0.9420290 0.02111708 0.9038355 0.9818365 Male
#> 6 30 130 1 0 0.9347826 0.02248469 0.8944820 0.9768989 Male
With a bit of data manipulation, we can define 100-day periods and renormalize the curves at the start of each period. Then we can plot using geom_step
df %>%
filter(time < 300) %>%
group_by(sex) %>%
mutate(period = factor(100 * floor(time / 100))) %>%
group_by(sex, period) %>%
mutate(surv = surv / first(surv)) %>%
ggplot(aes(time, surv, color = sex, group = interaction(period, sex))) +
geom_step(size = 1) +
geom_vline(xintercept = c(0, 100, 200, 300), linetype = 2) +
scale_color_manual(values = c("deepskyblue4", "orange")) +
theme_minimal(base_size = 16) +
theme(legend.position = "top")
Created on 2022-09-03 with reprex v2.0.2
Related
I have a data set which I uploaded here as a gist in CSV format.
It is the extracted form of the PDFs provided in the YouGov article "How good is 'good'?". People where asked to rate words (e.g. "perfect", "bad") with a score between 0 (very negative) and 10 (very positive). The gist contains exactly that data, i.e. for every word (column: Word) it stores for every ranking from 0 to 10 (column: Category) the number of votes (column: Total).
I would usually try to visualize the data with matplotlib and Python since I lack knowledge in R, but it seems that ggridges can create way nicer plots than I see myself doing with Python.
Using:
library(ggplot2)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov, aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
I was able to create this plot (which is still far from perfect):
Ignoring the fact that I have to tweak the aesthetics, there are three things I struggle to do:
Sort the words by their average rank.
Color the ridge by the average rank.
Or color the ridge by the category value, i.e. with varying color.
I tried to adapt the suggestions from this source, but ultimately failed because my data seems to be in the wrong format: Instead of having single instances of votes, I already have the aggregated vote count for each category.
I hope to end up with a result closer to this plot, which satisfies criteria 3 (source):
It took me a little while to get there myself. The key for me way understanding the data and how to order Word based on the average Category score. So let's look at the data first:
> YouGov
# A tibble: 440 x 17
ID Word Category Total Male Female `18 to 35` `35 to 54` `55+`
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 Incr~ 0 0 0 0 0 0 0
2 1 Incr~ 1 1 1 1 1 1 0
3 2 Incr~ 2 0 0 0 0 0 0
4 3 Incr~ 3 1 1 1 1 1 1
5 4 Incr~ 4 1 1 1 1 1 1
6 5 Incr~ 5 5 6 5 6 5 5
7 6 Incr~ 6 6 7 5 5 8 5
8 7 Incr~ 7 9 10 8 10 7 10
9 8 Incr~ 8 15 16 14 13 15 16
10 9 Incr~ 9 20 20 20 22 18 19
# ... with 430 more rows, and 8 more variables: Northeast <dbl>,
# Midwest <dbl>, South <dbl>, West <dbl>, White <dbl>, Black <dbl>,
# Hispanic <dbl>, `Other (NET)` <dbl>
Every Word has a row for every Category (or score, 1-10). The Total provides the number of responses for that Word/Category combination. So although there were no responses where the word "Incredible" scored zero there is still a row for it.
Before we calculate the average score for each Word we calculate the product of Category and Total for each Word-Category combination, let's call it Total Score. From there, we can treat Word as a factor, and reorder based on the average Total Score using forcats. After that, you can plot your data just as you did.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
YouGov %>%
mutate(total_score = Category*Total) %>%
mutate(Word = fct_reorder(.f = Word, .x = total_score, .fun = mean)) %>%
ggplot(aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
By treating Word as a factor we reordered the Words based on their mean Category. ggplot also orders colors accordingly so we don't have to modify ourselves, unless you'd prefer a different color palette.
The other solution is exactly correct. I just wanted to point out that you can call fct_reorder() from within aes() for an even more compact solution. However, you need to do it twice if you want to change fill color by position along the y axis.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = fct_reorder(Word, Category*Total, .fun = sum)
)) +
geom_density_ridges(stat = "identity", scale = 3) +
theme(legend.position = "none")
Created on 2020-01-19 by the reprex package (v0.3.0)
If instead you want to color by x position, you can do something like the following. It just doesn't look as nice as the temperature example because the x values come in discrete steps.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = stat(x)
)) +
geom_density_ridges_gradient(stat = "identity", scale = 3) +
theme(legend.position = "none") +
scale_fill_viridis_c(option = "C")
Created on 2020-01-19 by the reprex package (v0.3.0)
Is there a way to extract the values of the fitted line returned from stat_smooth?
The code I am using looks like this:
p <- ggplot(df1, aes(x=Days, y= Qty,group=Category,color=Category))
p <- p + stat_smooth(method=glm, fullrange=TRUE)+ geom_point())
This new r user would greatly appreciate any guidance.
Riffing off of #James example
p <- qplot(hp,wt,data=mtcars) + stat_smooth()
You can use the intermediate stages of the ggplot building process to pull out the plotted data. The results of ggplot_build is a list, one component of which is data which is a list of dataframes which contain the computed values to be plotted. In this case, the list is two dataframes since the original qplot creates one for points and the stat_smooth creates a smoothed one.
> ggplot_build(p)$data[[2]]
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
x y ymin ymax se PANEL group
1 52.00000 1.993594 1.149150 2.838038 0.4111133 1 1
2 55.58228 2.039986 1.303264 2.776709 0.3586695 1 1
3 59.16456 2.087067 1.443076 2.731058 0.3135236 1 1
4 62.74684 2.134889 1.567662 2.702115 0.2761514 1 1
5 66.32911 2.183533 1.677017 2.690049 0.2465948 1 1
6 69.91139 2.232867 1.771739 2.693995 0.2244980 1 1
7 73.49367 2.282897 1.853241 2.712552 0.2091756 1 1
8 77.07595 2.333626 1.923599 2.743652 0.1996193 1 1
9 80.65823 2.385059 1.985378 2.784740 0.1945828 1 1
10 84.24051 2.437200 2.041282 2.833117 0.1927505 1 1
11 87.82278 2.490053 2.093808 2.886297 0.1929096 1 1
12 91.40506 2.543622 2.145018 2.942225 0.1940582 1 1
13 94.98734 2.597911 2.196466 2.999355 0.1954412 1 1
14 98.56962 2.652852 2.249260 3.056444 0.1964867 1 1
15 102.15190 2.708104 2.303465 3.112744 0.1969967 1 1
16 105.73418 2.764156 2.357927 3.170385 0.1977705 1 1
17 109.31646 2.821771 2.414230 3.229311 0.1984091 1 1
18 112.89873 2.888224 2.478136 3.298312 0.1996493 1 1
19 116.48101 2.968745 2.531045 3.406444 0.2130917 1 1
20 120.06329 3.049545 2.552102 3.546987 0.2421773 1 1
21 123.64557 3.115893 2.573577 3.658208 0.2640235 1 1
22 127.22785 3.156368 2.601664 3.711072 0.2700548 1 1
23 130.81013 3.175495 2.625951 3.725039 0.2675429 1 1
24 134.39241 3.181411 2.645191 3.717631 0.2610560 1 1
25 137.97468 3.182252 2.658993 3.705511 0.2547460 1 1
26 141.55696 3.186155 2.670350 3.701961 0.2511175 1 1
27 145.13924 3.201258 2.687208 3.715308 0.2502626 1 1
28 148.72152 3.235698 2.721744 3.749652 0.2502159 1 1
29 152.30380 3.291766 2.782767 3.800765 0.2478037 1 1
30 155.88608 3.353259 2.857911 3.848607 0.2411575 1 1
31 159.46835 3.418409 2.938257 3.898561 0.2337596 1 1
32 163.05063 3.487074 3.017321 3.956828 0.2286972 1 1
33 166.63291 3.559111 3.092367 4.025855 0.2272319 1 1
34 170.21519 3.634377 3.165426 4.103328 0.2283065 1 1
35 173.79747 3.712729 3.242093 4.183364 0.2291263 1 1
36 177.37975 3.813399 3.347232 4.279565 0.2269509 1 1
37 180.96203 3.910849 3.447572 4.374127 0.2255441 1 1
38 184.54430 3.977051 3.517784 4.436318 0.2235917 1 1
39 188.12658 4.037302 3.583959 4.490645 0.2207076 1 1
40 191.70886 4.091635 3.645111 4.538160 0.2173882 1 1
41 195.29114 4.140082 3.700184 4.579981 0.2141624 1 1
42 198.87342 4.182676 3.748159 4.617192 0.2115424 1 1
43 202.45570 4.219447 3.788162 4.650732 0.2099688 1 1
44 206.03797 4.250429 3.819579 4.681280 0.2097573 1 1
45 209.62025 4.275654 3.842137 4.709171 0.2110556 1 1
46 213.20253 4.295154 3.855951 4.734357 0.2138238 1 1
47 216.78481 4.308961 3.861497 4.756425 0.2178456 1 1
48 220.36709 4.317108 3.859541 4.774675 0.2227644 1 1
49 223.94937 4.319626 3.851025 4.788227 0.2281358 1 1
50 227.53165 4.316548 3.836964 4.796132 0.2334829 1 1
51 231.11392 4.308435 3.818728 4.798143 0.2384117 1 1
52 234.69620 4.302276 3.802201 4.802351 0.2434590 1 1
53 238.27848 4.297902 3.787395 4.808409 0.2485379 1 1
54 241.86076 4.292303 3.772103 4.812503 0.2532567 1 1
55 245.44304 4.282505 3.754087 4.810923 0.2572576 1 1
56 249.02532 4.269040 3.733184 4.804896 0.2608786 1 1
57 252.60759 4.253361 3.710042 4.796680 0.2645121 1 1
58 256.18987 4.235474 3.684476 4.786473 0.2682509 1 1
59 259.77215 4.215385 3.656265 4.774504 0.2722044 1 1
60 263.35443 4.193098 3.625161 4.761036 0.2764974 1 1
61 266.93671 4.168621 3.590884 4.746357 0.2812681 1 1
62 270.51899 4.141957 3.553134 4.730781 0.2866658 1 1
63 274.10127 4.113114 3.511593 4.714635 0.2928472 1 1
64 277.68354 4.082096 3.465939 4.698253 0.2999729 1 1
65 281.26582 4.048910 3.415849 4.681971 0.3082025 1 1
66 284.84810 4.013560 3.361010 4.666109 0.3176905 1 1
67 288.43038 3.976052 3.301132 4.650972 0.3285813 1 1
68 292.01266 3.936392 3.235952 4.636833 0.3410058 1 1
69 295.59494 3.894586 3.165240 4.623932 0.3550782 1 1
70 299.17722 3.850639 3.088806 4.612473 0.3708948 1 1
71 302.75949 3.804557 3.006494 4.602619 0.3885326 1 1
72 306.34177 3.756345 2.918191 4.594499 0.4080510 1 1
73 309.92405 3.706009 2.823813 4.588205 0.4294926 1 1
74 313.50633 3.653554 2.723308 4.583801 0.4528856 1 1
75 317.08861 3.598987 2.616650 4.581325 0.4782460 1 1
76 320.67089 3.542313 2.503829 4.580796 0.5055805 1 1
77 324.25316 3.483536 2.384853 4.582220 0.5348886 1 1
78 327.83544 3.422664 2.259739 4.585589 0.5661643 1 1
79 331.41772 3.359701 2.128512 4.590891 0.5993985 1 1
80 335.00000 3.294654 1.991200 4.598107 0.6345798 1 1
Knowing a priori where the one you want is in the list isn't easy, but if nothing else you can look at the column names.
It is still better to do the smoothing outside the ggplot call, though.
EDIT:
It turns out replicating what ggplot2 does to make the loess is not as straightforward as I thought, but this will work. I copied it out of some internal functions in ggplot2.
model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity")
That gives a plot that looks identical to
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth()
(which is the expanded form of the original p).
stat_smooth does produce output that you can use elsewhere, and with a slightly hacky way, you can put it into a variable in the global environment.
You enclose the output variable in .. on either side to use it. So if you add an aes in the stat_smooth call and use the global assign, <<-, to assign the output to a varible in the global environment you can get the the fitted values, or others - see below.
qplot(hp,wt,data=mtcars) + stat_smooth(aes(outfit=fit<<-..y..))
fit
[1] 1.993594 2.039986 2.087067 2.134889 2.183533 2.232867 2.282897 2.333626
[9] 2.385059 2.437200 2.490053 2.543622 2.597911 2.652852 2.708104 2.764156
[17] 2.821771 2.888224 2.968745 3.049545 3.115893 3.156368 3.175495 3.181411
[25] 3.182252 3.186155 3.201258 3.235698 3.291766 3.353259 3.418409 3.487074
[33] 3.559111 3.634377 3.712729 3.813399 3.910849 3.977051 4.037302 4.091635
[41] 4.140082 4.182676 4.219447 4.250429 4.275654 4.295154 4.308961 4.317108
[49] 4.319626 4.316548 4.308435 4.302276 4.297902 4.292303 4.282505 4.269040
[57] 4.253361 4.235474 4.215385 4.193098 4.168621 4.141957 4.113114 4.082096
[65] 4.048910 4.013560 3.976052 3.936392 3.894586 3.850639 3.804557 3.756345
[73] 3.706009 3.653554 3.598987 3.542313 3.483536 3.422664 3.359701 3.294654
The outputs you can obtain are:
y, predicted value
ymin, lower pointwise confidence interval around
the mean
ymax, upper pointwise confidence interval around the mean
se, standard error
Note that by default it predicts on 80 data points, which may not be aligned with your original data.
A more general approach could be to simply use the predict() function to predict any range of values that are interesting.
# define the model
model <- loess(wt ~ hp, data = mtcars)
# predict fitted values for each observation in the original dataset
modelFit <- data.frame(predict(model, se = TRUE))
# define data frame for ggplot
df <- data.frame(cbind(hp = mtcars$hp
, wt = mtcars$wt
, fit = modelFit$fit
, upperBound = modelFit$fit + 2 * modelFit$se.fit
, lowerBound = modelFit$fit - 2 * modelFit$se.fit
))
# build the plot using the fitted values from the predict() function
# geom_linerange() and the second geom_point() in the code are built using the values from the predict() function
# for comparison ggplot's geom_smooth() is also shown
g <- ggplot(df, aes(hp, wt))
g <- g + geom_point()
g <- g + geom_linerange(aes(ymin = lowerBound, ymax = upperBound))
g <- g + geom_point(aes(hp, fit, size = 1))
g <- g + geom_smooth(method = "loess")
g
# Predict any range of values and include the standard error in the output
predict(model, newdata = 100:300, se = TRUE)
If you want to bring in the power of the tidyverse, you can use the "broom" library to add the predicted values from the loess function to your original dataset. This is building on #phillyooo's solution.
library(tidyverse)
library(broom)
# original graph with smoother
ggplot(data=mtcars, aes(hp,wt)) +
stat_smooth(method = "loess", span = 0.75)
# Create model that will do the same thing as under the hood in ggplot2
model <- loess(wt ~ hp, data = mtcars, span = 0.75)
# Add predicted values from model to original dataset using broom library
mtcars2 <- augment(model, mtcars)
# Plot both lines
ggplot(data=mtcars2, aes(hp,wt)) +
geom_line(aes(hp, .fitted), color = "red") +
stat_smooth(method = "loess", span = 0.75)
Save the graph object and use ggplot_build() or layer_data() to obtain the elements/estimates for the layers. e.g.
pp<-ggplot(mtcars, aes(x=hp, y=wt)) + geom_point() + geom_smooth();
ggplot_build(pp)
Problem Description
I am trying to make a swimmerplot in R using ggplot. However, I encounter a problem when I would like to have 'empty' space between two stacked bars of the plot: the bars are arranged next to one another.
Code & Sample data
I have the following sample data:
# Sample data
df <- read.table(text="patient start keytreat duration
sub-1 0 treat1 3
sub-1 8 treat2 2
sub-1 13 treat3 1.5
sub-2 0 treat1 4.5
sub-3 0 treat1 4
sub-3 4 treat2 8
sub-3 13.5 treat3 2", header=TRUE)
When I use the following code to generate a swimmerplot, I end up with a swimmerplot of 3 subjects. Subject 2 received only 1 treatment (treatment 1), so this displays correctly.
However, subject 1 received 3 treatments: treatment 1 from time point 0 up to time point 3, then nothing from 3 to 8, then treatment 2 from 8 until 10 etc...
The data is plotted in a way, that in patient 1 and 3 all treatments are consecutive instead of with 'empty' intervals in-between.
# Plot: bars
bars <- map(unique(df$patient)
, ~geom_bar(stat = "identity", position = "stack", width = 0.6,
, data = df %>% filter(patient == .x)))
# Create plot
ggplot(data = df, aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) +
coord_flip()
Question
How do I include empty spaces between two non-consecutive treatments in this swimmerplot?
I don't think geom_bar is the right geom in this case. It's really meant for showing frequencies or counts and you can't explicitly control their start or end coordinates.
geom_segment is probably what you want:
library(tidyverse)
# Sample data
df <- read.table(text="patient start keytreat duration
sub-1 0 treat1 3
sub-1 8 treat2 2
sub-1 13 treat3 1.5
sub-2 0 treat1 4.5
sub-3 0 treat1 4
sub-3 4 treat2 8
sub-3 13.5 treat3 2", header=TRUE)
# Add end of treatment
df_wrangled <- df %>%
mutate(end = start + duration)
ggplot(df_wrangled) +
geom_segment(
aes(x = patient, xend = patient, y = start, yend = end, color = keytreat),
size = 8
) +
coord_flip()
Created on 2019-03-29 by the reprex package (v0.2.1)
I want to plot percent survival per treatment (percent.o2). When y == 0% for a treatment, I get a fat bar. I'd like them to be the same width. Any advice appreciated.
Data looks like this:
> plotData
# A tibble: 12 x 4
# Groups: Percent.O2 [7]
Status Percent.O2 n percent
<fct> <fct> <int> <dbl>
1 Dead 1 144 1.00
2 Dead 3 141 0.979
3 Dead 7 144 1.00
4 Dead 10 105 0.729
5 Dead 13 69 0.958
6 Dead Control 12 0.167
7 Dead Control2 2 0.0278
8 Still_kicking 3 3 0.0208
9 Still_kicking 10 39 0.271
10 Still_kicking 13 3 0.0417
11 Still_kicking Control 60 0.833
12 Still_kicking Control2 70 0.972
Here's my code for the plot:
> ggplot(plotData, aes(x = Percent.O2, y = percent, fill = Status)) + geom_col(position = "dodge")
You can use the complete function from the tidyr package (which is loaded as part of the tidyverse suite of packages) to add rows for the missing levels and fill them with zero (NA, the default fill value, would work too). I've also added percent labels on the y-axis using percent from the scales package.
library(tidyverse)
library(scales)
ggplot(plotData %>%
complete(Status, nesting(Percent.O2), fill=list(n=0, percent=0)),
aes(x = Percent.O2, y = percent, fill = Status)) +
geom_col(position = "dodge") +
scale_y_continuous(labels=percent)
I have a data set with 20 variables. 10 of them are variables of great interest but these variables need to be adjusted for group differences in terms of age and sex. I do this by using regression, to predict values depending on age and sex.
There are many variables, and many persons, so I want a loop or similar.
Here is an example of what I'm attempting
# Load example data
library(survival)
library(dplyr)
data(lung) # example data
# I want to obtain adjusted values for the following two variables, called "dependents"
dependents <- names(select(lung, 7:8))
new_data <- lung # copies data set
for (i in seq_along(dependents)) {
eq <- paste(dependents[i],"~ age + sex")
fit <- lm(as.formula(eq), data= new_data)
new_data$predicted_value <- predict(fit, newdata=new_data, type='response')
new_data <- rename(new_data, paste(dependents[i], "_predicted", sep="") = predicted_value)
}
View(new_data)
This failed to provide me with the "dependents" in adjusted (i.e predicted) form.
Any ideas?
Thanks in advance
Here is an alternative approach, using the tidyr package and the augment function from my broom package:
library(tidyr)
library(broom)
new_data <- lung %>%
gather(dependent, value, ph.karno:pat.karno) %>%
group_by(dependent) %>%
do(augment(lm(value ~ age + sex, data = .)))
This reorganizes the data so that each dependent (ph.karno and pat.karno) is stacked on top of each other, distinguished by a dependent column. The augment function turns each model into a data frame with columns for fitted values, residuals, and other values you care about (see ?lm_tidiers for more). The .fitted column then gives the fitted values:
new_data
#> Source: local data frame [452 x 12]
#> Groups: dependent
#>
#> dependent .rownames value age sex .fitted .se.fit .resid
#> 1 ph.karno 1 90 74 1 78.86709 1.406553 11.132915
#> 2 ph.karno 2 90 68 1 80.53347 1.115994 9.466530
#> 3 ph.karno 3 90 56 1 83.86624 1.226463 6.133759
#> 4 ph.karno 4 90 57 1 83.58851 1.181024 6.411490
#> 5 ph.karno 5 100 60 1 82.75532 1.078170 17.244683
#> 6 ph.karno 6 50 74 1 78.86709 1.406553 -28.867085
#> 7 ph.karno 7 70 68 2 80.18860 1.419744 -10.188596
#> 8 ph.karno 8 60 71 2 79.35540 1.555365 -19.355404
#> 9 ph.karno 9 70 53 1 84.69943 1.388600 -14.699433
#> 10 ph.karno 10 70 61 1 82.47759 1.056850 -12.477586
#> .. ... ... ... ... ... ... ... ...
#> Variables not shown: .hat (dbl), .sigma (dbl), .cooksd (dbl), .std.resid
#> (dbl)
As one way you could use this data, you could graph how the predictions for the dependent variables differ:
ggplot(new_data, aes(age, .fitted, color = dependent, lty = factor(sex))) +
geom_line()
If you're looking to control for the age and sex, however, you probably want to work with the .resid column.
Can't you just do this?
dependents <- names(lung)[7:8]
fit <- lm(as.formula(sprintf("cbind(%s) ~ age + sex",
paste(dependents, collapse = ", "))),
data = lung)
predict(fit)
Maybe I'm misunderstanding. Your question isn't very clear.
And a third approach.
new_data <- na.omit(lung[,c("sex","age",dependents)])
result <- lapply(new_data[,dependents],
function(y)predict(lm(y~age+sex,data.frame(y=y,new_data[,c("age","sex")]))))
names(result) <- paste(names(result),"predicted",sep="_")
result <- cbind(new_data,as.data.frame(result))
head(result)
# sex age ph.karno pat.karno ph.karno_predicted pat.karno_predicted
# 1 1 74 90 100 78.83030 77.34670
# 2 1 68 90 90 80.59974 78.53841
# 3 1 56 90 90 84.13862 80.92183
# 4 1 57 90 60 83.84371 80.72321
# 5 1 60 100 90 82.95899 80.12736
# 6 1 74 50 80 78.83030 77.34670
Your original code has a couple of subtle problems (other than the fact that it doesn't run). The response variables have a few NAs, which are removed automatically by lm(...), so the prediction has fewer rows that the original data set, and when you try to add the new column with, e.g.
new_data$predicted_value <- predict(fit, newdata=new_data, type='response')
you get an error. You have to remove the NAs from new_data first, as shown in the code above.
I'm also wondering, since your data seems to be counts of something, if you should be using a poisson glm instead of lm?