How to produce leverage stats? - r

I know how to produce the plots using leveragePlot(), but I can not find a way to produce a statistic for leverage for each observation like in megastat output.

I think you're looking for the hat values.
Use hatvalues(fit). The rule of thumb is to examine any observations 2-3 times greater than the average hat value. I don't know of a specific function or package off the top of my head that provides this info in a nice data frame but doing it yourself is fairly straight forward. Here's an example:
fit <- lm(hp ~ cyl + mpg, data=mtcars) #a fake model
hatvalues(fit)
hv <- as.data.frame(hatvalues(fit))
mn <-mean(hatvalues(fit))
hv$warn <- ifelse(hv[, 'hatvalues(fit)']>3*mn, 'x3',
ifelse(hv[, 'hatvalues(fit)']>2*mn, 'x3', '-' ))
hv
For larger data sets you could use subset and/or orderto look at just certain values ranges for the hat values:
subset(hv, warn=="x3")
subset(hv, warn%in%c("x2", "x3"))
hv[order(hv['hatvalues(fit)']), ]
I actually came across a nice plot function that does this in the book R in Action but as this is a copyrighted book I will not display Kabacoff's intellectual property. But that plot would work even better for mid sized data sets.
Here is a decent hat plot though that you may also want to investigate:
plot(hatvalues(fit), type = "h")

Related

Is there a way to add species to an ISOMAP plot in R?

I am using the isomap-function from vegan package in R to analyse community data of epiphytic mosses and lichens. I started analysing the data using NMDS but due to the structure of the data ran into problems which is why I switched to ISOMAP which works perfectly well and returns very nice results. So far so good... However, the output of the function does not support plotting of species within the ISOMAP plot as species scores are not available. Anyway, I would really like to add species information to enhance the interpretability of the output.
Does anyone of you has a solution or hint to this problem? Is there a way to add species kind of post hoc to the plot as it can be done with environmental data?
I would greatly appreciate any help on this topic!
Thank you and best regards,
Inga
No, there is no function to add species scores to isomap. It would look like this:
`sppscores<-.isomap` <-
function(object, value)
{
value <- scale(value, center = TRUE, scale = FALSE)
v <- crossprod(value, object$points)
attr(v, "data") <- deparse(substitute(value))
object$species <- v
object
}
Or alternatively:
`sppscores<-.isomap` <-
function(object, value)
{
wa <- vegan::wascores(object$points, value, expand = TRUE)
attr(wa, "data") <- deparse(substitute(value))
object$species <- wa
object
}
If ord is your isomap result and comm are your community data, you can use these as:
sppscores(ord) <- comm # either alternative
I have no idea (yet) which of these alternatives is more correct. The first adds species scores as vectors of their linear increase, the second as their weighted averages in ordination space, but expanded so that we allow some species be more extreme than the site units where they occur.
These will add new element species to the result object ord. However, using these in vegan would need more coding, but you can extract the species scores with vegan::scores, but their scaling is based on the original scale of community data, and may be badly scaled with respect to points of site units, and working on this would require more work. However, you can plot them separately, or then multiply with a constant giving similar scaling as site unit scores.
sp <- scores(ord, display="species", choices=1:2)
plot(sp, type = "n", asp = 1) # does not allow plotting text
text(sp, labels = rownames(sp)) # so we must add text

grouping without additional packages

I'm using R to plot my data, but am unable to install packages for the moment as my workplace has put up a lot of firewalls (currently trying to get IT to get them down).
In the meantime, I was wondering if by using the plot() function I was able to plot my data in groups.
I have three variables in my data: IDName, Value, and Setpoints.
I wanted to aggregate my values for each setpoint thus I used the aggregate() function although this will aggregate all data for each setpoint, whereby I only want it to aggregate depending on the IDName. All forms of grouping seem to require a package, thus I was wondering if anyone knew any workarounds.
I've supplied the code below (note that the R script is within PowerBI, but for the purposes of my question only R expertise is needed). It would also be great if you know how to colour these points accordingly to each IDName.
# dataset <- data.frame(IDName, Value, Setpoints)
# dataset <- unique(dataset)
# Paste or type your script code here:
dat <- aggregate(Value ~ Setpoints, dataset, mean)
x <- dat$Value
y <- dat$Setpoints
z <- dataset$IDName
plot(x,y, main ="Turbidity Frequency Distribution",xlab="% Time < Turbidity level", ylab="Turbidity (NTU)")
lines(spline(x,y))

Advise a Chemist: Automate/Streamline his Voltammetry Data Graphing Code

I am a chemist dealing with a significant amount of voltammetry data recently. Let me be very clear and give some research information. I run scans from a starting voltage to an ending voltage on solid state conductive films. These scans are saved as .txt files (name scheme: run#.txt) in a single folder. I am looking at how conductance changes as temperature changes. The LINEST line plotting current v. voltage at a given temperature gives me a line with slope = conductance. Once I have the conductances (slopes) for each scan, I plot conductance v. temperature to see the temperature dependent conductance characteristics. I had been doing this in Excel, but have found quicker ways to get the job done using R. I am brand new to R (Rstudio) and recognize that my coding is not the best. Without doubt, this process can be streamlined and sped up which would help immensely. This is how I am performing the process currently:
# Set working directory with folder containing all .txt files for inspection
# Add all .txt files to the global environment
allruns<-list.files(pattern=".txt")
for(i in 1:length(allruns))assign(allruns[i],read.table(allruns[i]))
Since the voltage column (a 1x1000 matrix) is the same for all runs and is in column V1 of each .txt file, I assign a x to be the voltage column from the first folder
x<-run1.txt$V1
All currents (these change as voltage changes) are found in the V2 column of all the .txt files, so I assign y# to each. These are entered one at a time..
y1<-run1.txt$V2
y2<-run2.txt$V2
y3<-run3.txt$V2
# ...
yn<-runn.txt$V2
So that I can get the eqn for each LINEST (one LINEST for each scan and plotted with abline later). Again entered one at a time:
run1<-lm(y1~x)
run2<-lm(y2~x)
run3<-lm(y3~x)
# ...
runn<-lm(yn~x)
To obtain a single graph with all LINEST (one for each scan ) on the same plot, without the data points showing up, I have been using this pattern of coding to first get all data points on a single plot in separate series:
plot(x,y1,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y2,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y3,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
# ...
par(new=TRUE)
plot(x,yn,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
#To obtain all LINEST lines (one for each scan, on the single graph):
abline(run1,col=””, lwd=1)
abline(run2,col=””,lwd=1)
abline(run3,col=””,lwd=1)
# ...
abline(runn,col=””,lwd=1)
# Then to get each LINEST equation:
summary(run1)
summary(run2)
summary(run3)
# ...
summary(runn)
Each time I use summary(), I copy the slope and paste it into an Excel sheet- along with corresponding scan temp which I have recorded separately. I then graph the conductance v temp points for the film as X-Y scatter with smooth lines to give the temperature dependent conductance curve. Giving me a single LINEST lines plot in R and the conductance v temp in Excel.
This technique is actually MUCH quicker than doing it all in Excel, but it can be done much quicker and efficiently!!! Also, if I need to change something, this entire process needs to be reexecuted with whatever change is necessary. This process takes me maybe 5 hours in Excel and 1.5 hours in R (maybe I am too slow). Nonetheless, any tips to help automate/streamline this further are greatly appreciated.
There are plenty of questions about operating on data in lists; storing a list of matrix or a list of data.frame is fast, and code that operates cleanly on one can be applied to the remaining n-1 very easily.
(Note: the way I'm showing it here is one technique: maintaining everything in well-compartmentalized lists. Other will suggest -- very justifiably -- that combing things into a single data.frame and adding a group variable (to identify from which file/experiment the data originated) will help with more advanced multi-experiment regression or combined plotting, such as with ggplot2. I'm not going to go into this latter technique here, not yet.)
It is long decried not to do for(...) assign(..., read.csv(...)); you have the important part done, so this is relatively easy:
allruns <- sapply(list.files(pattern = "*.txt"), read.table, simplify = FALSE)
(The use of sapply(..., simplify=FALSE) is similar to lapply(...), but it has a nice side-effect of naming the individual list-ified elements with, in this case, each filename. It may not be critical here but is quite handy elsewhere.)
Extracting your invariant and variable data is simple enough:
allLMs <- lapply(allruns, function(mdl) lm(V2 ~ V1, data = mdl))
I'm using each table's V1 here instead of a once-extracted x ... though you might wonder why, I argue keeping it like for two reasons: (1) JUST IN CASE the V1 variable is ever even one-row-different, this will save you; (2) it is very easy to construct the model like this.
At this point, each object within allLMs is an lm object, meaning we might do:
summary(allLMs[[1]])
Plotting: I think I understand why you are using par=NEW, and I have to laugh ... I had been deep in R for a while before I started using that technique. What I think you need is actually much simpler:
xlim <- rev(range(allruns[[1]]$V1))
ylim <- range(sapply(allruns, `[`, "V2"))
# this next plot just sets the box and axes, no points
plot(NA, type = "na", xlim = xlim, ylim = ylim)
# no need to plot points with "transparent" ...
ign <- sapply(allLMs, abline, col = "") # and other abline options ...
Copying all models into Excel, again, using lists:
out <- do.call(rbind, sapply(allLMs, function(m) summary(m)$coefficients[,1]))
This will now be a single data.frame with all coefficients in two columns. (Feel free to use similar techniques to extract the other model summary attributes, including std err, t.value, or Pr(>|t|) (in the $coefficients); or $r.squared, $adj.r.squared, etc.)
write.csv(out, file="clipboard", sep="\t")
and paste into Excel. (Or, better yet, save it to a CSV file and import that, since you might want to keep it around.)
One of the tricks to using lists for this is to persevere: keep things in lists as long as you can, so that you don't have deal with models individually. One mantra is that if you do it once, you shouldn't have to type it again, just loop/apply/map/whatever. Don't extract too much from the lists before you have to.
Note: r2evans' answer provides good general advice and doesn't require heavy package dependencies. But it probably doesn't hurt to see alternative strategies.
The tidyverse can be quite handy for this sort of thing, here's a dummy example for illustration,
library(tidyverse)
# creating dummy data files
dummy <- function(T) {
V <- seq(-5, 5, length=20)
I <- jitter(T*V + T, factor = 1)
write.table(data.frame(V=V, I = I),
file = paste0(T,".txt"),
row.names = FALSE)
}
purrr::walk(300:320, dummy)
# reading
lf <- list.files(pattern = "\\.txt")
read_one <- function(f, ...) {cbind(T = as.numeric(gsub("\\.txt", "", f)), read.table(f, ...))}
m <- purrr::map_df(lf, read_one, header = TRUE, .id="id")
head(m)
ggplot(m, aes(V, I, group = T)) +
facet_wrap( ~ T) +
geom_point() +
geom_smooth(se = FALSE)
models <- m %>%
split(.$T) %>%
map(~lm(I ~ V, data = .))
coefs <- models %>% map_df(broom::tidy, .id = "T")
ggplot(coefs, aes(as.numeric(T), estimate)) +
geom_line() +
facet_wrap(~term, scales = "free")

R : Bad graphic of ordered boxplot according to median

Here is what I am trying to do : I have a data.frame (data) of 160 rows with 2 variables (fact (8 groups) and response) and I want to do a boxplot of response ~ fact, ordered in increasing order of the medians.
Code :
data <- read.table("box.txt",header=T)
attach(data)
index <- order(tapply(response,fact,median))
ordered <- factor(rep(index,rep(20,8)))
boxplot(response~ordered,notch=T,names=as.character(index),xlab="treatments",ylab="response")
but on the graphic the boxes are badly plotted (not in the right order and with "false" Min, Max, etc...).
I'm using RStudio with R 3.0.2 on Windows 7.
Any clue about what does that mean?
One reproducible and seemingly correct answer would be :
set.seed(1)
data <- data.frame(response=10*rnorm(160), fact=factor(rep(1:8), labels=letters[1:8]))
data$fact <- reorder(data$fact, data$response, median)
boxplot(response~fact, data=data, notch=TRUE, xlab="treatments", ylab="response")
Names on the ticks of the x axis are correct, without further ado.
No idea why it looks 'bad', but the order is wrong because you use order instead of rank to find the index. For the other issues you probably have to make a reproducible example.
The reproducible example is as follows, with two boxplots to compare. In my case the plot (possibly) looks bad because of the devil's ears. Regarding the OP's question, I interpret his phrasing as bad referring to the fact that using order() instead of rank() resulted in other mishap as well (although I wouldn't know why).
data <- data.frame(response=rnorm(160), fact=factor(rep(1:8), labels=letters[1:8]))
boxplot(response~fact, data=data, notch=TRUE, xlab="treatments", ylab="response")
data$ordered <- rank(tapply(data$response, data$fact, median))
boxplot(response~ordered, data=data, notch=TRUE, xlab="treatments", ylab="response")

1-D conditional slice from a 2-D probability density function in R using np package

consider the included example in the np-package for r,
page 21 of the Vignettes for np package.
npcdens returns a conditional density object and is able to plot 2d-pdf and 2d-cdf, as shown. I wanted to know if I can somehow extract the 1-D information (pdf / cdf) from the object if I were to specify one of the two parameters, like in a vector or something ?? I am new to R and was not able to find out the format of the object.
Thanks for the help.
-Egon.
Here is the code as requested:
require(np)
data("Italy")
attach(Italy)
bw <- npcdensbw(formula=gdp~ordered(year), tol=.1, ftol=.1)
fhat <- npcdens(bws=bw)
summary(fhat)
npplot(bws=bw)
npplot(bws=bw, cdf=TRUE)
detach(Italy)
The fhat object contains all the needed info plus a whole lot more. To see what all is in there, do a str( fhat ) to see the structure.
I believe the values you are interested in are xeval, yeval, and condens (PDF density).
There are lots of ways to get at the values but I tend to like data frames. I'd pop the three vectors in a single data frame:
denDf <- cbind( year=as.character( fhat$xeval[,1] ), fhat$yeval, fhat$condens )
## had to do a dance around the year variable because it's a factor
then I'd select the values I want with a subset():
subset( denDf, year==1951 & gdp > 8 & gdp < 8.2)
since gdp is a floating point value it's very hard to select with a == operator.
The method suggested by JD Long will only extract density for data points in the existing training set. If you want the density at other points (conditioning or conditional variables) you will need to use the predict()
function. The following code extracts and plots the 1-D density distribution conditioned on year ==1999, a value not contained in the original data set.
First construct a data frame with the same components as the Italy data set, with gdp regularly spaced and with "1999" an ordered factor.
yr1999<- rep("1999", 100)
gdpVals <-seq(1,35, length.out=100)
nD1999 <- data.frame(year = ordered(yr1999), gdp = gdpVals)
Next use the predict function to extract the densities.
gdpDens1999 <-predict(fhat,newdata = nD1999)
The following code plots the density.
plot(gdpVals, gdpDens1999, type='l', col='red', xlab='gdp', ylab = 'p(gdp|yr = 1999)')

Resources