How to put 2 boxplot in one graph in R without additional libraries? - r

I have this kind of dataset
Defect.found Treatment Program
1 Testing Counter
1 Testing Correlation
0 Inspection Counter
3 Testing Correlation
2 Inspection Counter
I would like to create two boxplotes, one boxplot of detected defects per program and one boxplot of detected defects per technique but in one graph.
Meaning having:
boxplot(exp$Defect.found ~ exp$Treatment)
boxplot(exp$Defect.found ~ exp$Program)
In a joined graph.
Searching on Stackoverflow I was able to create it but with lattice library typing:
bwplot(exp$Treatment + exp$Program ~ exp$Defects.detected)
but i would like to know if its possible to create the graph without additional libraries like ggplot and lattice

Prepare the plot window to receive two plots in one row and two columns (default is obviously one row and one column):
par(mfrow = c(1, 2))
My suggestion is to avoid using the word exp, because it is already used for the exponential function. Use for instance mydata.
Defects found against treatment (frame = F suppresses the external box):
with(mydata, plot(Defect.found ~ Treatment, frame = F))
Defects found against program (ylab = NA suppresses the y label because it is already shown in the previous plot):
with(mydata, plot(Defect.found ~ Program, frame = F, ylab = NA))

Related

Error in axis(side = side, at = at, labels = labels, ...) : invalid value specified for graphical parameter "pch"

I have applied DBSCAN algorithm on built-in dataset iris in R. But I am getting error when tried to visualise the output using the plot( ).
Following is my code.
library(fpc)
library(dbscan)
data("iris")
head(iris,2)
data1 <- iris[,1:4]
head(data1,2)
set.seed(220)
db <- dbscan(data1,eps = 0.45,minPts = 5)
table(db$cluster,iris$Species)
plot(db,data1,main = 'DBSCAN')
Error: Error in axis(side = side, at = at, labels = labels, ...) :
invalid value specified for graphical parameter "pch"
How to rectify this error?
I have a suggestion below, but first I see two issues:
You're loading two packages, fpc and dbscan, both of which have different functions named dbscan(). This could create tricky bugs later (e.g. if you change the order in which you load the packages, different functions will be run).
It's not clear what you're trying to plot, either what the x- or y-axes should be or the type of plot. The function plot() generally takes a vector of values for the x-axis and another for the y-axis (although not always, consult ?plot), but here you're passing it a data.frame and a dbscan object, and it doesn't know how to handle it.
Here's one way of approaching it, using ggplot() to make a scatterplot, and dplyr for some convenience functions:
# load our packages
# note: only loading dbscacn, not loading fpc since we're not using it
library(dbscan)
library(ggplot2)
library(dplyr)
# run dbscan::dbscan() on the first four columns of iris
db <- dbscan::dbscan(iris[,1:4],eps = 0.45,minPts = 5)
# create a new data frame by binding the derived clusters to the original data
# this keeps our input and output in the same dataframe for ease of reference
data2 <- bind_cols(iris, cluster = factor(db$cluster))
# make a table to confirm it gives the same results as the original code
table(data2$cluster, data2$Species)
# using ggplot, make a point plot with "jitter" so each point is visible
# x-axis is species, y-axis is cluster, also coloured according to cluster
ggplot(data2) +
geom_point(mapping = aes(x=Species, y = cluster, colour = cluster),
position = "jitter") +
labs(title = "DBSCAN")
Here's the image it generates:
If you're looking for something else, please be more specific about what the final plot should look like.

grouping without additional packages

I'm using R to plot my data, but am unable to install packages for the moment as my workplace has put up a lot of firewalls (currently trying to get IT to get them down).
In the meantime, I was wondering if by using the plot() function I was able to plot my data in groups.
I have three variables in my data: IDName, Value, and Setpoints.
I wanted to aggregate my values for each setpoint thus I used the aggregate() function although this will aggregate all data for each setpoint, whereby I only want it to aggregate depending on the IDName. All forms of grouping seem to require a package, thus I was wondering if anyone knew any workarounds.
I've supplied the code below (note that the R script is within PowerBI, but for the purposes of my question only R expertise is needed). It would also be great if you know how to colour these points accordingly to each IDName.
# dataset <- data.frame(IDName, Value, Setpoints)
# dataset <- unique(dataset)
# Paste or type your script code here:
dat <- aggregate(Value ~ Setpoints, dataset, mean)
x <- dat$Value
y <- dat$Setpoints
z <- dataset$IDName
plot(x,y, main ="Turbidity Frequency Distribution",xlab="% Time < Turbidity level", ylab="Turbidity (NTU)")
lines(spline(x,y))

Using multiple datasets for one graph

I have 2 csv data files. Each file has a "date_time" column and a "temp_c" column. I want to make the x-axis have the "date_time" from both files and then use 2 y-axes to display each "temp_c" with separate lines. I would like to use plot instead of ggplot2 if possible. I haven't been able to find any code help that works with my data and I'm not sure where to really begin. I know how to do 2 separate plots for these 2 datasets, just not combine them into one graph.
plot(grewl$temp_c ~ grewl$date_time)
and
plot(kbll$temp_c ~ kbll$date_time)
work separately but not together.
As others indicated, it is easy to add new data to a graph using points() or lines(). One thing to be careful about is how you format the axes as they will not be automatically adjusted to fit any new data you input using points() and the like.
I've included a small example below that you can copy, paste, run, and examine. Pay attention to why the first plot fails to produce what you want (axes are bad). Also note how I set this example up generally - by making fake data that showcase the same "problem" you are having. Doing this is often a better strategy than simply pasting in your data since it forces you to think about the core component of the problem you are facing.
#for same result each time
set.seed(1234)
#make data
set1<-data.frame("date1" = seq(1,10),
"temp1" = rnorm(10))
set2<-data.frame("date2" = seq(8,17),
"temp2" = rnorm(10, 1, 1))
#first attempt fails
#plot one
plot(set1$date1, set1$temp1, type = "b")
#add points - oops only three showed up bc the axes are all wrong
lines(set2$date2, set2$temp2, type = "b")
#second attempt
#adjust axes to fit everything (set to min and max of either dataset)
plot(set1$date1, set1$temp1,
xlim = c(min(set1$date1,set2$date2),max(set1$date1,set2$date2)),
ylim = c(min(set1$temp1,set2$temp2),max(set1$temp1,set2$temp2)),
type = "b")
#now add the other points
lines(set2$date2, set2$temp2, type = "b")
# we can even add regression lines
abline(reg = lm(set1$temp1 ~ set1$date1))
abline(reg = lm(set2$temp2 ~ set2$date2))

Advise a Chemist: Automate/Streamline his Voltammetry Data Graphing Code

I am a chemist dealing with a significant amount of voltammetry data recently. Let me be very clear and give some research information. I run scans from a starting voltage to an ending voltage on solid state conductive films. These scans are saved as .txt files (name scheme: run#.txt) in a single folder. I am looking at how conductance changes as temperature changes. The LINEST line plotting current v. voltage at a given temperature gives me a line with slope = conductance. Once I have the conductances (slopes) for each scan, I plot conductance v. temperature to see the temperature dependent conductance characteristics. I had been doing this in Excel, but have found quicker ways to get the job done using R. I am brand new to R (Rstudio) and recognize that my coding is not the best. Without doubt, this process can be streamlined and sped up which would help immensely. This is how I am performing the process currently:
# Set working directory with folder containing all .txt files for inspection
# Add all .txt files to the global environment
allruns<-list.files(pattern=".txt")
for(i in 1:length(allruns))assign(allruns[i],read.table(allruns[i]))
Since the voltage column (a 1x1000 matrix) is the same for all runs and is in column V1 of each .txt file, I assign a x to be the voltage column from the first folder
x<-run1.txt$V1
All currents (these change as voltage changes) are found in the V2 column of all the .txt files, so I assign y# to each. These are entered one at a time..
y1<-run1.txt$V2
y2<-run2.txt$V2
y3<-run3.txt$V2
# ...
yn<-runn.txt$V2
So that I can get the eqn for each LINEST (one LINEST for each scan and plotted with abline later). Again entered one at a time:
run1<-lm(y1~x)
run2<-lm(y2~x)
run3<-lm(y3~x)
# ...
runn<-lm(yn~x)
To obtain a single graph with all LINEST (one for each scan ) on the same plot, without the data points showing up, I have been using this pattern of coding to first get all data points on a single plot in separate series:
plot(x,y1,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y2,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y3,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
# ...
par(new=TRUE)
plot(x,yn,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
#To obtain all LINEST lines (one for each scan, on the single graph):
abline(run1,col=””, lwd=1)
abline(run2,col=””,lwd=1)
abline(run3,col=””,lwd=1)
# ...
abline(runn,col=””,lwd=1)
# Then to get each LINEST equation:
summary(run1)
summary(run2)
summary(run3)
# ...
summary(runn)
Each time I use summary(), I copy the slope and paste it into an Excel sheet- along with corresponding scan temp which I have recorded separately. I then graph the conductance v temp points for the film as X-Y scatter with smooth lines to give the temperature dependent conductance curve. Giving me a single LINEST lines plot in R and the conductance v temp in Excel.
This technique is actually MUCH quicker than doing it all in Excel, but it can be done much quicker and efficiently!!! Also, if I need to change something, this entire process needs to be reexecuted with whatever change is necessary. This process takes me maybe 5 hours in Excel and 1.5 hours in R (maybe I am too slow). Nonetheless, any tips to help automate/streamline this further are greatly appreciated.
There are plenty of questions about operating on data in lists; storing a list of matrix or a list of data.frame is fast, and code that operates cleanly on one can be applied to the remaining n-1 very easily.
(Note: the way I'm showing it here is one technique: maintaining everything in well-compartmentalized lists. Other will suggest -- very justifiably -- that combing things into a single data.frame and adding a group variable (to identify from which file/experiment the data originated) will help with more advanced multi-experiment regression or combined plotting, such as with ggplot2. I'm not going to go into this latter technique here, not yet.)
It is long decried not to do for(...) assign(..., read.csv(...)); you have the important part done, so this is relatively easy:
allruns <- sapply(list.files(pattern = "*.txt"), read.table, simplify = FALSE)
(The use of sapply(..., simplify=FALSE) is similar to lapply(...), but it has a nice side-effect of naming the individual list-ified elements with, in this case, each filename. It may not be critical here but is quite handy elsewhere.)
Extracting your invariant and variable data is simple enough:
allLMs <- lapply(allruns, function(mdl) lm(V2 ~ V1, data = mdl))
I'm using each table's V1 here instead of a once-extracted x ... though you might wonder why, I argue keeping it like for two reasons: (1) JUST IN CASE the V1 variable is ever even one-row-different, this will save you; (2) it is very easy to construct the model like this.
At this point, each object within allLMs is an lm object, meaning we might do:
summary(allLMs[[1]])
Plotting: I think I understand why you are using par=NEW, and I have to laugh ... I had been deep in R for a while before I started using that technique. What I think you need is actually much simpler:
xlim <- rev(range(allruns[[1]]$V1))
ylim <- range(sapply(allruns, `[`, "V2"))
# this next plot just sets the box and axes, no points
plot(NA, type = "na", xlim = xlim, ylim = ylim)
# no need to plot points with "transparent" ...
ign <- sapply(allLMs, abline, col = "") # and other abline options ...
Copying all models into Excel, again, using lists:
out <- do.call(rbind, sapply(allLMs, function(m) summary(m)$coefficients[,1]))
This will now be a single data.frame with all coefficients in two columns. (Feel free to use similar techniques to extract the other model summary attributes, including std err, t.value, or Pr(>|t|) (in the $coefficients); or $r.squared, $adj.r.squared, etc.)
write.csv(out, file="clipboard", sep="\t")
and paste into Excel. (Or, better yet, save it to a CSV file and import that, since you might want to keep it around.)
One of the tricks to using lists for this is to persevere: keep things in lists as long as you can, so that you don't have deal with models individually. One mantra is that if you do it once, you shouldn't have to type it again, just loop/apply/map/whatever. Don't extract too much from the lists before you have to.
Note: r2evans' answer provides good general advice and doesn't require heavy package dependencies. But it probably doesn't hurt to see alternative strategies.
The tidyverse can be quite handy for this sort of thing, here's a dummy example for illustration,
library(tidyverse)
# creating dummy data files
dummy <- function(T) {
V <- seq(-5, 5, length=20)
I <- jitter(T*V + T, factor = 1)
write.table(data.frame(V=V, I = I),
file = paste0(T,".txt"),
row.names = FALSE)
}
purrr::walk(300:320, dummy)
# reading
lf <- list.files(pattern = "\\.txt")
read_one <- function(f, ...) {cbind(T = as.numeric(gsub("\\.txt", "", f)), read.table(f, ...))}
m <- purrr::map_df(lf, read_one, header = TRUE, .id="id")
head(m)
ggplot(m, aes(V, I, group = T)) +
facet_wrap( ~ T) +
geom_point() +
geom_smooth(se = FALSE)
models <- m %>%
split(.$T) %>%
map(~lm(I ~ V, data = .))
coefs <- models %>% map_df(broom::tidy, .id = "T")
ggplot(coefs, aes(as.numeric(T), estimate)) +
geom_line() +
facet_wrap(~term, scales = "free")

Multiple histograms in Julia using Plots.jl

I am working with a large number of observations and to really get to know it I want to do histograms using Plots.jl
My question is how I can do multiple histograms in one plot as this would be really handy. I have tried multiple things already, but I am a bit confused with the different plotting sources in julia (plots.jl, pyplot, gadfly,...).
I don't know if it would help for me to post some of my code, as this is a more general question. But I am happy to post it, if needed.
There is an example that does just this:
using Plots
pyplot()
n = 100
x1, x2 = rand(n), 3rand(n)
# see issue #186... this is the standard histogram call
# our goal is to use the same edges for both series
histogram(Any[x1, x2], line=(3,0.2,:green), fillcolor=[:red :black], fillalpha=0.2)
I looked for "histograms" in the Plots.jl repo, found this related issue and followed the links to the example.
With Plots, there are two possibilities to show multiple series in one plot:
First, you can use a matrix, where each column constitutes a separate series:
a, b, c = randn(100), randn(100), randn(100)
histogram([a b c])
Here, hcat is used to concatenate the vectors (note the spaces instead of commas).
This is equivalent to
histogram(randn(100,3))
You can apply options to the individual series using a row matrix:
histogram([a b c], label = ["a" "b" "c"])
(Again, note the spaces instead of commas)
Second, you can use plot! and its variants to update a previous plot:
histogram(a) # creates a new plot
histogram!(b) # updates the previous plot
histogram!(c) # updates the previous plot
Alternatively, you can specify which plot to update:
p = histogram(a) # creates a new plot p
histogram(b) # creates an independent new plot
histogram!(p, c) # updates plot p
This is useful if you have several subplots.
Edit:
Following Felipe Lema's links, you can implement a recipe for histograms that share the edges:
using StatsBase
using PlotRecipes
function calcbins(a, bins::Integer)
lo, hi = extrema(a)
StatsBase.histrange(lo, hi, bins) # nice edges
end
calcbins(a, bins::AbstractVector) = bins
#userplot GroupHist
#recipe function f(h::GroupHist; bins = 30)
args = h.args
length(args) == 1 || error("GroupHist should be given one argument")
bins = calcbins(args[1], bins)
seriestype := :bar
bins, mapslices(col -> fit(Histogram, col, bins).weights, args[1], 1)
end
grouphist(randn(100,3))
Edit 2:
Because it is faster, I changed the recipe to use StatsBase.fit for creating the histogram.

Resources