How to plot a large ctree() to avoid overlapping nodes - r

When I plotted the decision tree result from ctree() from party package, the font was too big and the box was also too big. They are overlapping other nodes.
Is there a way to customize the output from plot() so that the box and the font would be smaller ?

The short answer seems to be, no, you cannot change the font size, but there are some good other options.
I know of three possible solutions. First, you can change other parameters in the plot to make it more compact. Second, you can write it to a graphic file and view that file. Third, you can use an alternative implementation of ctree() in the partykit package, which is a newer package by some of the same authors.
Default Plot Example
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
plot(airct) #default plot, some crowding with N hidden on leafs
Simplified plot
# simpler version of plot
plot(airct, type="simple", # no terminal plots
inner_panel=node_inner(airct,
abbreviate = TRUE, # short variable names
pval = FALSE, # no p-values
id = FALSE), # no id of node
terminal_panel=node_terminal(airct,
abbreviate = TRUE,
digits = 1, # few digits on numbers
fill = c("white"), # make box white not grey
id = FALSE)
)
This is somewhat better and one might be able to improve it further. To figure out these details, I originally did class(airct) which returned "BinaryTree". Armed with this info, I started reading ?plot.BinaryTree
Write to a file
A second simple solution is to write the plot to a file and then view the file. You may need to play with the settings to find the best fit.
png("airct.png", res=80, height=800, width=1600)
plot(airct)
dev.off()
Plot with partykit package instead
Finally, you can use a newer and not-yet-finished re-implementation of the party package by some of the same authors. At this point (Dec 2012), the only function they have re-done is ctree(). This version allows you to change font size.
library(partykit)
airct <- ctree(Ozone ~ ., data = airq)
class(airct) # different class from before
# "constparty" "party"
plot(airct, gp = gpar(fontsize = 6), # font size changed to 6
inner_panel=node_inner,
ip_args=list(
abbreviate = TRUE,
id = FALSE)
)
Here I have left the leafs in their default setting because I have frankly never figured out how to get it to work the way I want. I suspect this has to do with the fact that the package is incomplete (as of Dec 2012). You can read about the plot method starting with ?plot.party

Another option (that doesn't change what you want but does potentially solve the underlying problem) is to change the size of the figure itself, as I learned in my class for my assignment.
Replace the r in the below:
{r}
with:
{r, fig.width=X, fig.height=Y}
where the X and Y need to be replaced by numbers chosen by you depending on what size you think works better.
This website, talks about doing this in more detail and universally throughout the document.

Related

Using multiple datasets for one graph

I have 2 csv data files. Each file has a "date_time" column and a "temp_c" column. I want to make the x-axis have the "date_time" from both files and then use 2 y-axes to display each "temp_c" with separate lines. I would like to use plot instead of ggplot2 if possible. I haven't been able to find any code help that works with my data and I'm not sure where to really begin. I know how to do 2 separate plots for these 2 datasets, just not combine them into one graph.
plot(grewl$temp_c ~ grewl$date_time)
and
plot(kbll$temp_c ~ kbll$date_time)
work separately but not together.
As others indicated, it is easy to add new data to a graph using points() or lines(). One thing to be careful about is how you format the axes as they will not be automatically adjusted to fit any new data you input using points() and the like.
I've included a small example below that you can copy, paste, run, and examine. Pay attention to why the first plot fails to produce what you want (axes are bad). Also note how I set this example up generally - by making fake data that showcase the same "problem" you are having. Doing this is often a better strategy than simply pasting in your data since it forces you to think about the core component of the problem you are facing.
#for same result each time
set.seed(1234)
#make data
set1<-data.frame("date1" = seq(1,10),
"temp1" = rnorm(10))
set2<-data.frame("date2" = seq(8,17),
"temp2" = rnorm(10, 1, 1))
#first attempt fails
#plot one
plot(set1$date1, set1$temp1, type = "b")
#add points - oops only three showed up bc the axes are all wrong
lines(set2$date2, set2$temp2, type = "b")
#second attempt
#adjust axes to fit everything (set to min and max of either dataset)
plot(set1$date1, set1$temp1,
xlim = c(min(set1$date1,set2$date2),max(set1$date1,set2$date2)),
ylim = c(min(set1$temp1,set2$temp2),max(set1$temp1,set2$temp2)),
type = "b")
#now add the other points
lines(set2$date2, set2$temp2, type = "b")
# we can even add regression lines
abline(reg = lm(set1$temp1 ~ set1$date1))
abline(reg = lm(set2$temp2 ~ set2$date2))

Advise a Chemist: Automate/Streamline his Voltammetry Data Graphing Code

I am a chemist dealing with a significant amount of voltammetry data recently. Let me be very clear and give some research information. I run scans from a starting voltage to an ending voltage on solid state conductive films. These scans are saved as .txt files (name scheme: run#.txt) in a single folder. I am looking at how conductance changes as temperature changes. The LINEST line plotting current v. voltage at a given temperature gives me a line with slope = conductance. Once I have the conductances (slopes) for each scan, I plot conductance v. temperature to see the temperature dependent conductance characteristics. I had been doing this in Excel, but have found quicker ways to get the job done using R. I am brand new to R (Rstudio) and recognize that my coding is not the best. Without doubt, this process can be streamlined and sped up which would help immensely. This is how I am performing the process currently:
# Set working directory with folder containing all .txt files for inspection
# Add all .txt files to the global environment
allruns<-list.files(pattern=".txt")
for(i in 1:length(allruns))assign(allruns[i],read.table(allruns[i]))
Since the voltage column (a 1x1000 matrix) is the same for all runs and is in column V1 of each .txt file, I assign a x to be the voltage column from the first folder
x<-run1.txt$V1
All currents (these change as voltage changes) are found in the V2 column of all the .txt files, so I assign y# to each. These are entered one at a time..
y1<-run1.txt$V2
y2<-run2.txt$V2
y3<-run3.txt$V2
# ...
yn<-runn.txt$V2
So that I can get the eqn for each LINEST (one LINEST for each scan and plotted with abline later). Again entered one at a time:
run1<-lm(y1~x)
run2<-lm(y2~x)
run3<-lm(y3~x)
# ...
runn<-lm(yn~x)
To obtain a single graph with all LINEST (one for each scan ) on the same plot, without the data points showing up, I have been using this pattern of coding to first get all data points on a single plot in separate series:
plot(x,y1,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y2,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y3,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
# ...
par(new=TRUE)
plot(x,yn,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
#To obtain all LINEST lines (one for each scan, on the single graph):
abline(run1,col=””, lwd=1)
abline(run2,col=””,lwd=1)
abline(run3,col=””,lwd=1)
# ...
abline(runn,col=””,lwd=1)
# Then to get each LINEST equation:
summary(run1)
summary(run2)
summary(run3)
# ...
summary(runn)
Each time I use summary(), I copy the slope and paste it into an Excel sheet- along with corresponding scan temp which I have recorded separately. I then graph the conductance v temp points for the film as X-Y scatter with smooth lines to give the temperature dependent conductance curve. Giving me a single LINEST lines plot in R and the conductance v temp in Excel.
This technique is actually MUCH quicker than doing it all in Excel, but it can be done much quicker and efficiently!!! Also, if I need to change something, this entire process needs to be reexecuted with whatever change is necessary. This process takes me maybe 5 hours in Excel and 1.5 hours in R (maybe I am too slow). Nonetheless, any tips to help automate/streamline this further are greatly appreciated.
There are plenty of questions about operating on data in lists; storing a list of matrix or a list of data.frame is fast, and code that operates cleanly on one can be applied to the remaining n-1 very easily.
(Note: the way I'm showing it here is one technique: maintaining everything in well-compartmentalized lists. Other will suggest -- very justifiably -- that combing things into a single data.frame and adding a group variable (to identify from which file/experiment the data originated) will help with more advanced multi-experiment regression or combined plotting, such as with ggplot2. I'm not going to go into this latter technique here, not yet.)
It is long decried not to do for(...) assign(..., read.csv(...)); you have the important part done, so this is relatively easy:
allruns <- sapply(list.files(pattern = "*.txt"), read.table, simplify = FALSE)
(The use of sapply(..., simplify=FALSE) is similar to lapply(...), but it has a nice side-effect of naming the individual list-ified elements with, in this case, each filename. It may not be critical here but is quite handy elsewhere.)
Extracting your invariant and variable data is simple enough:
allLMs <- lapply(allruns, function(mdl) lm(V2 ~ V1, data = mdl))
I'm using each table's V1 here instead of a once-extracted x ... though you might wonder why, I argue keeping it like for two reasons: (1) JUST IN CASE the V1 variable is ever even one-row-different, this will save you; (2) it is very easy to construct the model like this.
At this point, each object within allLMs is an lm object, meaning we might do:
summary(allLMs[[1]])
Plotting: I think I understand why you are using par=NEW, and I have to laugh ... I had been deep in R for a while before I started using that technique. What I think you need is actually much simpler:
xlim <- rev(range(allruns[[1]]$V1))
ylim <- range(sapply(allruns, `[`, "V2"))
# this next plot just sets the box and axes, no points
plot(NA, type = "na", xlim = xlim, ylim = ylim)
# no need to plot points with "transparent" ...
ign <- sapply(allLMs, abline, col = "") # and other abline options ...
Copying all models into Excel, again, using lists:
out <- do.call(rbind, sapply(allLMs, function(m) summary(m)$coefficients[,1]))
This will now be a single data.frame with all coefficients in two columns. (Feel free to use similar techniques to extract the other model summary attributes, including std err, t.value, or Pr(>|t|) (in the $coefficients); or $r.squared, $adj.r.squared, etc.)
write.csv(out, file="clipboard", sep="\t")
and paste into Excel. (Or, better yet, save it to a CSV file and import that, since you might want to keep it around.)
One of the tricks to using lists for this is to persevere: keep things in lists as long as you can, so that you don't have deal with models individually. One mantra is that if you do it once, you shouldn't have to type it again, just loop/apply/map/whatever. Don't extract too much from the lists before you have to.
Note: r2evans' answer provides good general advice and doesn't require heavy package dependencies. But it probably doesn't hurt to see alternative strategies.
The tidyverse can be quite handy for this sort of thing, here's a dummy example for illustration,
library(tidyverse)
# creating dummy data files
dummy <- function(T) {
V <- seq(-5, 5, length=20)
I <- jitter(T*V + T, factor = 1)
write.table(data.frame(V=V, I = I),
file = paste0(T,".txt"),
row.names = FALSE)
}
purrr::walk(300:320, dummy)
# reading
lf <- list.files(pattern = "\\.txt")
read_one <- function(f, ...) {cbind(T = as.numeric(gsub("\\.txt", "", f)), read.table(f, ...))}
m <- purrr::map_df(lf, read_one, header = TRUE, .id="id")
head(m)
ggplot(m, aes(V, I, group = T)) +
facet_wrap( ~ T) +
geom_point() +
geom_smooth(se = FALSE)
models <- m %>%
split(.$T) %>%
map(~lm(I ~ V, data = .))
coefs <- models %>% map_df(broom::tidy, .id = "T")
ggplot(coefs, aes(as.numeric(T), estimate)) +
geom_line() +
facet_wrap(~term, scales = "free")

Mosaic plot and text values

I created structable from Titanic dataset and used mosaic function for it. Everything worked great, hovewer I also wanted to label each box from mosaic plot with quantity of titanic passangers given their Class, Survival and Sex. As it turns out, I am not able to do that. I know I need to use labeling_cells to achive that, hovewer i am not able to use it (and i wan't able to find any example) in combination with stuctable and below code.
library("vcd")
struct <- structable(~ Class + Survived + Sex, data = Titanic)
mosaic(struct, data = Titanic, shade = TRUE, direction = "v")
If I understand your question correctly, then the last example in ?labeling_cells is pretty close to what you want to do. Using your example, the labeling_cells() can be added afterwards provided that the viewport tree is not popped. The only aspect that is somewhat awkward is that the struct object has to be a regular table again for the labeling. I have to ask David, the main author, whether this could be handled automatically.
mosaic(struct, shade = TRUE, direction = "v", pop = FALSE)
labeling_cells(text = as.table(struct), margin = 0)(as.table(struct))
Fixed in upstream in vcd 1.4-4, but note that you can simply use
mosaic(struct, labeling = labeling_values)

manipulate text edge labels ctree

Got a ctree with four labels, but the categories are long text therefore just the first is shown.
category;presence;ratio;tested;located
palindromic_recursion;1;0;0;0
conceptual_comprehension;0;1;0;0
infoxication_syndrome;0;0;1;0
foreign_words_abuse;0;0;0;1
palindromic_recursion;1;0;0;0
conceptual_comprehension;0;1;0;0
infoxication_syndrome;0;0;1;0
foreign_words_abuse;0;0;0;1
concepts.ctree <- ctree(category ~., data)
plot(concepts.ctree)
is there any way or parameter for manipultating (rotate) text, edge label names and this way force them to be all shown in plot?
My real data is much bigger but this sample is ok to test it if you do not use zoom tool.
Regards
There wasn't an option for this up to now. But I just tweaked the development version of partykit on R-Forge to support this feature. Currently, the package is re-building but hopefully you can soon say install.packages("partykit", repos = "http://R-Forge.R-project.org") - or if you don't want to wait that long, simply check out the SVN and re-build yourself.
In the new version, you can pass the rot and just arguments to grid.text() to control rotation and justification of the x-axis labels.
Read the data:
data <- read.csv2(textConnection(
"category;presence;ratio;tested;located
palindromic_recursion;1;0;0;0
conceptual_comprehension;0;1;0;0
infoxication_syndrome;0;0;1;0
foreign_words_abuse;0;0;0;1
palindromic_recursion;1;0;0;0
conceptual_comprehension;0;1;0;0
infoxication_syndrome;0;0;1;0
foreign_words_abuse;0;0;0;1"
))
Fit the tree (using the partykit implementation of ctree()):
library("partykit")
concepts.ctree <- ctree(category ~ ., data = data)
For visualization create a viewport with sufficiently large margins on the x-axis first. Then, add the tree to the existing viewport page and set the rotation/justification arguments for the barplot.
pushViewport(plotViewport(margins = c(6, 0, 0, 0)))
plot(concepts.ctree, tp_args = list(rot = 45, just = c("right", "top")),
newpage = FALSE)

In R, how can I make the branches of a classification tree not overlap in a plot?

I have a tree with a lot of branches. Here is my code to plot the tree. The problem is that the labels overlap each other, specially towards the bottom of the tree. Is there any way to plot the tree so that the labels don't overlap?
par(mfrow=c(1,1))
plot(prunedTree, type=c("uniform"))
text(prunedTree)
Note--I used "type=c("uniform"))" because it helped readability the lower branches. Also, prunedTree is the class "tree" from the tree package.
Here's a sample of what is being produced currently.
EDIT: Code to fully reproduce the issue.
load(url("https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda"))
samsungData$subject <- factor(samsungData$subject)
samsungData$activity <- factor(samsungData$activity)
samsungData <- samsungData[, !c(duplicated(names(samsungData)))]
names(samsungData) <- gsub("[.]", "", names(samsungData))
samsungData <- data.frame(samsungData)
trainDF <- samsungData[samsungData$subject %in% c(1,3,5,6),]
tree1 <- tree(activity ~ ., data=trainDF)
plot(tree1)
text(tree1)
You have several general options:
Use a wider graphics device. (i.e. png(...,width = 1200,height = ...))
Shrink the text using cex = 0.5 (or smaller)
Use more concise column (i.e. variable) names
Some combination of the previous three.
I thought I could get text.tree to use fewer significant digits in labeling the splits, but I can't seem to do that. rpart appears to use only 4 digits by default, so that would save you some space as well.
In addition to joran indications listed above, you can play with parameters:
srt to rtotate your text.
give different colors for text
For example :
plot(tree1)
text(tree1,col=rainbow(5)[1:25],srt=85,cex=0.8)

Resources