I'm a new R user and I am trying to chart an interaction between 2 continuous variables and a categorical variable.
Using interaction.plot:
interaction.plot(nonconform, trans, employdisc, type="b", col=(1:3) ,
leg.bty="o", leg.bg="beige", lwd=2, pch=c(18,24,22),
xlab="Nonconformity",
ylab="Discrimination",
main="Interaction Plot")
I get this result:
interaction plot
When I attempt to do the same thing with ggplot
ggplot(data=NTDS.zip, aes(x=nonconform, y=employdisc, colour = factor(trans), group=trans, )) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(fun.y=mean, geom="line")
I get this result:
ggplot chart
There is an extra line (in grey that I can't get rid off). Its likely representing missing data, but haven't found a way to remove that line from the chart. Any discussion I found talked about suppressing warning due to missing data, but nothing regarding extra lines in a chart.
Any thoughts?
Update
After reading the R Graphics Cookbook I tried another method.
THe book's method involved summarizing the data first.
tg <- ddply(ntds.new, c("trans", "nonconform"), summarize, empdisc=mean(employdisc))
and then plotting the chart.
I tried 2 types (colour and linetype)
ggplot(tg, aes(x=nonconform, y=empdisc, colour=trans))+geom_line()
ggplot(tg, aes(x=nonconform, y=empdisc, linetype=trans))+geom_line()
The plot with the colour statement has the extra line, while the plot with linetype does not.
the data for this was:
trans nonconform empdisc
1 1 0 1.104046
2 1 1 1.472050
3 1 2 1.930070
4 1 3 2.247706
5 1 4 3.407407
6 1 NA 7.250000
7 2 0 3.427230
8 2 1 3.929707
9 2 2 4.062275
10 2 3 4.373853
11 2 4 4.470149
12 2 NA 5.294118
13 3 0 1.309524
14 3 1 1.968310
15 3 2 2.366589
16 3 3 3.815000
17 3 4 3.560606
18 3 NA 6.000000
19 4 0 2.661290
20 4 1 3.208861
21 4 2 3.033195
22 4 3 3.322176
23 4 4 3.755906
24 4 NA 6.625000
25 NA 0 4.000000
26 NA 1 4.166667
27 NA 2 2.500000
28 NA 3 6.666667
29 NA 4 5.400000
30 NA NA 2.000000
I went back and deleted the (10) lines with missing cases for either trans or nonconform columns.
trans nonconform empdisc
1 1 0 1.104046
2 1 1 1.472050
3 1 2 1.930070
4 1 3 2.247706
5 1 4 3.407407
6 2 0 3.427230
7 2 1 3.929707
8 2 2 4.062275
9 2 3 4.373853
10 2 4 4.470149
11 3 0 1.309524
12 3 1 1.968310
13 3 2 2.366589
14 3 3 3.815000
15 3 4 3.560606
16 4 0 2.661290
17 4 1 3.208861
18 4 2 3.033195
19 4 3 3.322176
20 4 4 3.755906
This solved my initial problem but this solution seems more complicated than it should be, and I'm curious as to why the plot with "colour" was affected and the one with "linetype" wasn't.
If we look in your data in table tg then there are NA values for the variable trans.
When you use trans (as factor) for the colors of the lines those NA values are also plotted because for color scales default action for NA levels is to plot them in grey50 color (na.value="grey50"). But for the linetype scales default action for NA levels is to plot blank line (na.value="blank") so you don't see the line.
To solve the problem there are couple of solutions. First, you can add the scale_color_discrete() and set the na.value= to NA.
ggplot(tg, aes(x=nonconform, y=empdisc, colour=as.factor(trans)))+
geom_line()+
scale_color_discrete(na.value=NA)
Another solution is to subset your data to remove NA values from your data and then plot your data. This can be done also inside the ggplot() call.
ggplot(tg[complete.cases(tg),], aes(x=nonconform, y=empdisc, colour=as.factor(trans)))+
geom_line()
Related
Note: as I'm writing this I can't figure out how to insert images, I'll work on it after posting, but if you run the code below, you should be able to see the graphs I'm talking about....sorry!
Essentially, I have these two graphs and I want them to be on the same plot (overlayed on top of one another), but I need them to use different color schemes or I won't be able to tell them apart very easily.
I've looked everywhere on this site and while there are a lot of similar questions, none of them have worked quite in the way that I need them to. The closest ones I've linked below, just know that I've read them and they did not solve my issues:
Distinct color palettes for two different groups in ggplot2
R ggplot two color palette on the same plot
The first graph uses this data (shortened to 50 lines, actually goes to about 1000), RuleCount repeats 1-14 over and over, TrainingPass goes up until about 60
RuleCount TrainingPass m4Accuracy
1 1 -1 0.000000000
2 2 -1 0.000000000
3 3 -1 0.004225352
4 4 -1 0.014225352
5 5 -1 0.022816901
6 6 -1 0.182957746
7 7 -1 0.194507042
8 8 -1 0.207183099
9 9 -1 0.239859155
10 10 -1 0.362394366
11 11 -1 0.430704225
12 12 -1 0.567887324
13 13 -1 0.582535211
14 14 -1 0.602676056
15 1 0 0.000000000
16 2 0 0.000281690
17 3 0 0.006901408
18 4 0 0.018732394
19 5 0 0.031267606
20 6 0 0.202676056
21 7 0 0.215633803
22 8 0 0.231830986
23 9 0 0.262253521
24 10 0 0.373661972
25 11 0 0.440281690
26 12 0 0.573802817
27 13 0 0.588169014
28 14 0 0.608873239
29 1 1 0.000985915
30 2 1 0.014788732
31 3 1 0.032957746
32 4 1 0.071408451
33 5 1 0.113943662
34 6 1 0.276760563
35 7 1 0.290281690
36 8 1 0.303943662
37 9 1 0.335633803
38 10 1 0.438028169
39 11 1 0.501971831
40 12 1 0.625070423
41 13 1 0.637323944
42 14 1 0.658169014
43 1 2 0.000985915
44 2 2 0.015915493
45 3 2 0.030704225
46 4 2 0.076619718
47 5 2 0.119436620
48 6 2 0.280563380
49 7 2 0.294507042
50 8 2 0.308732394
I graphed it using this code:
ggplot(df_m4, aes(x=RuleCount, y=m4Accuracy, group = TrainingPass, color = TrainingPass)) +
geom_line()+
scale_color_gradient(low = "green", high = "blue")
Resulting in this graph:
m4 Accuracy
The second graph is essentially the same data and code, except rather than getting a bunch of slightly varying lines on the graph, each of the lines ends up being the same line
data:
RuleCount TrainingPass Accuracy
1 1 -1 0.000422535
2 2 -1 0.000422535
3 3 -1 0.002676056
4 4 -1 0.005915493
5 5 -1 0.007746479
6 6 -1 0.053239437
7 7 -1 0.059718310
8 8 -1 0.068309859
9 9 -1 0.099859155
10 10 -1 0.197042254
11 11 -1 0.256197183
12 12 -1 0.421971831
13 13 -1 0.440422535
14 14 -1 0.468028169
15 1 0 0.000422535
16 2 0 0.000422535
17 3 0 0.002676056
18 4 0 0.005915493
19 5 0 0.007746479
20 6 0 0.053239437
21 7 0 0.059718310
22 8 0 0.068309859
23 9 0 0.099859155
24 10 0 0.197042254
25 11 0 0.256197183
26 12 0 0.421971831
27 13 0 0.440422535
28 14 0 0.468028169
29 1 1 0.000422535
30 2 1 0.000422535
31 3 1 0.002676056
32 4 1 0.005915493
33 5 1 0.007746479
34 6 1 0.053239437
35 7 1 0.059718310
36 8 1 0.068309859
37 9 1 0.099859155
38 10 1 0.197042254
39 11 1 0.256197183
40 12 1 0.421971831
41 13 1 0.440422535
42 14 1 0.468028169
43 1 2 0.000422535
44 2 2 0.000422535
45 3 2 0.002676056
46 4 2 0.005915493
47 5 2 0.007746479
48 6 2 0.053239437
49 7 2 0.059718310
50 8 2 0.068309859
code:
ggplot(df_rules_only, aes(x=RuleCount, y=Accuracy, group = TrainingPass, color = TrainingPass)) +
geom_line() +
scale_color_gradient(low = "green", high = "blue")
Resulting in this graph:
rules only Accuracy
I understand how to get the data on to the same graph. By combining my two data frames and using the code below, I can add the 'rules_only' data to the 'm4' graph:
ggplot(df_Training, aes(x=ruleCount, y=m4Accuracy, group = training_pass, color = training_pass)) +
geom_line() +
scale_color_gradient(low = "green", high = "blue")+
geom_line(aes(x=ruleCount, y=rulesOnlyAccuracy))
Resulting in this graph:
both_data_sets
The problem is that the new data blends right in with the old because it has the same color scheme.
At first I tried keeping them in the same data frame and just adding "color = 'orange'" to the last line of the previous code, but that gives me the error: "Error: Discrete value supplied to continuous scale"
Next I split them up into the two data frames you see above and tried to graph them this way:
ggplot(df_m4, aes(x=RuleCount, y=m4Accuracy, group = TrainingPass, color = TrainingPass)) +
geom_line() +
scale_color_gradient(low = "green", high = "blue")+
geom_line(df_rules_only, aes(x=RuleCount, y=Accuracy, color = "orange"))
but I get the error: "Error: mapping must be created by aes()"
Those last two attempts were kind of shots in the dark since I couldn't find anything else to try, but I'm pretty certain R doesn't work that way.
I'd really prefer for answers to use ggplot since other graphs never look quite as good. Just really feel like I've been going about this all wrong and could really use some help! Thank you in advance :)
Very complicated question for a very simple answer. Wanted to move this out of the comments but #aosmith helped me out. The code below makes my second group of data a different color:
ggplot(df_Training, aes(x=ruleCount, y=m4Accuracy, group = training_pass, color = training_pass)) +
geom_line() +
geom_line(aes(x=ruleCount, y=rulesOnlyAccuracy), color = "orange")
Just have to work on adding a second legend now!
I have a dataframe df with two columns, which are plotted in a scatterplot using ggplot. Now I have parted the curve into intervalls. The sectioning points of the intervalls are in a vector r. I now want to highlight these points to improve the visualization of the intervalls. I was thinking about coloring these intervall points or even to section the intervalls in adding vertical lines into the plot...I have tried some commands, but they didnt work for me.
Here is an idea of how my data frame looks like:
d is first colume, e is the second with number of instances.
d e
1 4
2 4
3 5
4 5
5 5
6 4
7 2
8 3
9 1
10 3
11 2
12 3
13 3
14 3
15 3
16 3
17 3
18 4
My vector r shows, where the intervall borders were set.
7
8
9
10
11
12
18
Any ideas how to do so? Thanks!
You can try a tidyverse. The idea is to find overlapping points using ´mutateand%in%, then color by the resutling logical vectorgr`. I also added vertical lines to illustrate the "intervals".
library(tidyverse)
d %>%
mutate(gr=d %in% r) %>%
ggplot(aes(d,e, color=gr)) +
geom_vline(xintercept=r, alpha=.1) +
geom_point()
Edit: Without tidyverse you can add gr using
d$gr <- d$d %in% r
ggplot(d, aes(d,e, color=gr)) ...
The data
d <- read.table(text=" d e
1 4
2 4
3 5
4 5
5 5
6 4
7 2
8 3
9 1
10 3
11 2
12 3
13 3
14 3
15 3
16 3
17 3
18 4", header=T)
r <- c(7:12,18)
I am trying to create a scatterpie plot with the scatterpie package in R. My data looks something like this
EEE Innovation n equal negative positive n_mod
0 0 2 NA 2 NA 0.3162278
0 1 6 4 2 NA 0.5477226
0 2 1 NA 1 NA 0.2236068
0 3 2 NA 2 NA 0.3162278
0 5 1 1 NA NA 0.2236068
1 0 4 2 1 1 0.4472136
1 1 14 4 5 5 0.5916080
1 2 9 3 2 4 0.4743416
1 3 1 NA 1 NA 0.1581139
1 5 1 NA 1 NA 0.1581139
2 1 3 NA 2 1 0.2738613
3 0 1 NA 1 NA 0.1581139
3 1 3 1 2 NA 0.2738613
3 2 4 NA 2 2 0.3162278
4 0 3 2 1 NA 0.2738613
4 1 14 5 3 6 0.5916080
4 2 14 4 NA 10 0.5916080
For creating my plot I use this command:
ggplot() +
geom_scatterpie(aes(x=EEE,y=Innovation, r = n_mod), data=pie_data,
cols=c("equal","negative","positive")) +
geom_scatterpie_legend((all_pie_data$n_mod), n=7,
labeller= function(x) x=sort(unique(pie_data$n)))
I use n_mod which I got with
for (l in 1:17) {
all_pie_data$n_mod[l] <- sqrt(all_pie_data$n[l]/40)
}
instead of n as radius because the radii of the pies would be too large for my graph and smaller pies would be buried under the larger ones. For the legend I want to have the radii of the n_mod, but with the label of the "real" n values.
When i try to create this plot I get the following error message:
Error in $<-.data.frame(*tmp*, "label", value = c(1L, 2L, 3L, 4L, :
replacement has 7 rows, data has 5
This error does not show up if I use anything lower than 24 in my n_mod creation:
for (l in 1:17) {
all_pie_data$n_mod[l] <- sqrt(all_pie_data$n[l]/24)
}
The pies generated by this are still to large for my graphs:
Does anyone have an idea how I can solve this problem or another way to create smaller pies?
P.S: This is my first question here, if I did something wrong with the formatting or any information is missing I am willing to improve!
You could set "r" to:
r = n_mod/2
This should make them look smaller.
ggplot() +
geom_scatterpie(aes(x=EEE,y=Innovation, r = n_mod/2), data=pie_data,
cols=c("equal","negative","positive")) +
geom_scatterpie_legend((all_pie_data$n_mod), n=7,
labeller= function(x) x=sort(unique(pie_data$n)))
I have the following data from a tree analysis:
train = sample(1:nrow(dd),1010)
yhat1 <- predict(tree.model1,newdata=dd[-train,])
v10.test <- dd$v10[-train]
dd is my data.frame, v10 is the (discrete) response variable that varies between 1 and 10, and train is a sample drawn from my dataframe.
I want to plot the predictions yhat1 with the actual test values v10.test, with the point size taking into account the number of actual test.values that are assigned to that yhat1 as prediction.
Thus:
plot(yhat1, v10.test, cex = ???)
The values for cex that I need can be drawn from the table object, but I don't know how. Any ideas?
table(yhat1, dd.test)
v10.test
yhat1 0 1 2 3 4 5 6 7 8 9 10
2.99479166666667 17 26 7 21 10 8 7 7 8 3 6
4.36725663716814 8 15 21 14 14 14 13 12 4 5 4
4.75 1 1 3 1 0 2 2 2 1 1 0
4.82710280373832 6 10 5 11 7 11 11 18 22 3 2
5.73684210526316 1 5 1 9 7 13 10 7 12 7 12
6.68 0 1 0 1 0 3 1 1 0 0 1
6.92045454545455 0 2 3 2 5 5 4 7 6 9 6
The symbols function may be preferable to using plot and cex when you want the size of points to depend on an additional variable. Note that you will generally get the best representation when using the square root of the variable to determine size (so that the area is proportional).
I played around a bit more and it turns out my main problem was not with the table but with the standard settings for pch and the standard size of the points, which made the resulting graph impossible to interpret.
So a way of doing it simply is
plot(yhat1, dd.test, pch = 20, cex = table(yhat1,v10.test)/10)
That does the trick (and shows how poor the data fit is)
I'm trying to create a 3d scatter plot using the following script:
d <- read.table(file='myfile.dat', header=F)
plot3d(
d,
xlim=c(0,20),
ylim=c(0,20),
zlim=c(0,10000),
xlab='Frequency',
ylab='Size',
zlab='Number of subgraphs',
box=F,
type='s',
size=0.5,
col=d[,1]
)
lines3d(
d,
xlim=c(2,20),
ylim=c(0,20),
zlim=c(0,10000),
lwd=2,
col=d[,1]
)
grid3d(side=c('x', 'y+', 'z'))
Now for some reason, R is ignoring the range limits I've specified and is using arbitrary values, messing up my plot. I get no error when I run the script. Does anybody have any idea what's wrong? If required, I can also post an image of the plot that is created. The data file is given below:
myfile.dat
11 2 2
NA NA NA
10 2 2
NA NA NA
13 2 1
NA NA NA
15 2 1
NA NA NA
5 2 11
5 3 10
5 4 16
5 5 34
5 6 102
5 7 294
5 8 682
5 9 1439
5 10 2646
5 11 3615
5 12 2844
5 13 1394
NA NA NA
4 2 10
4 3 4
4 4 4
4 5 10
4 6 38
4 7 132
4 8 396
4 9 976
4 10 2121
4 11 4085
4 12 6261
4 13 6459
4 14 4238
4 15 1394
NA NA NA
7 2 3
NA NA NA
6 2 2
NA NA NA
9 2 8
9 3 6
9 4 4
9 5 5
NA NA NA
8 2 4
8 3 10
8 4 22
8 5 52
8 6 126
8 7 264
8 8 478
8 9 729
8 10 943
8 11 754
8 12 382
NA NA NA
The help page, ?plot3d says "Note that since rgl does not currently support clipping, all points will be plotted, and 'xlim', 'ylim', and 'zlim' will only be used to increase the respective ranges." So you need to restrict the data in the input stage. (And you will need to use segments3d instead of lines3d if you only want particular ranges that are inside the plotted volume.)
d2 <- subset(d, d[,1]>0 & d[,1] <20 & d[,2]>0 & d[,2] <20 & d[,3]>0 & d[,3]<10000 ])
plot3d(
d2[, 1:3], # You can probably use something more meaningful,
xlim=c(0,20),
ylim=c(0,20),
zlim=c(0,10000),
xlab='Frequency',
ylab='Size',
zlab='Number of subgraphs',
box=F,
type='s',
size=0.5,
col=d[,1]
)
(I did notice that when the range was c(0,10000) that the size of the points was pretty much invisible. and further experimentation suggest that the great disparity in ranges is going to cause furhter difficulties in keeping the ranges at 0 on the low side if you increase the size to the point where it is visible. If you make the points really big , they expand the range to accommodate the overlap beyond the x=0 or y=0 planes.)
As DWin said, lines3d does not handle *lim arguments. From the help page, "... Material properties (see rgl.material), normals and texture coordinates (see rgl.primitive)."
So use some other function, or perhaps you could retrieve the existing limits from your plot3d call and use those to scale your data prior to plotting?