How to include variable values in histogram titles in R - using by() - r

I want to produce histograms using by(), how can I access the values of the factors, to include in histogram headings, for example...
a <- runif(500, 0, 10)
b <- LETTERS[1:5]
c <- c("Condition1", "Condition2")
x <- data.frame("Variable1" = b, "Variable2"= c, "Value"=a)
head(x)
by(x$Value, x$Variable2, hist)
or using two variables
by(x$Value, list(x$Variable2, x$Variable1), hist)
Is there a way of passing the variable value (eg Condition1) to the title of the histogram using the options within hist(), eg putting function(x) hist(x, main=...) into by()?

Pass the split up dataframe rather than just the Values. Then you will have more to work with:
by(x, x$Variable2, function(x) hist(x$Value, main=unique(x$Variable2) ) )
Produced two plots labled Condition1, Condition2

This doesn't really answer your question, since you're specifying the use of by(), but I usually use split() and lapply() for these types of problems. My approach is usually along the lines of:
temp <- split(x$Value, list(x$Variable2, x$Variable1))
lapply(names(temp), function(x) hist(temp[[x]], main = x, xlab = "Value"))

Related

Plot multiple columns saved in data frame with no x

My problem is multifaceted.
I would like to plot multiple columns saved in a data frame. Those columns do not have an x variable but would essentially be 1 to 101 consistent for all. I have seen that I can transfer them into long format but most ggplot options require an X. I tried zoo which does what I want it to, but the x-label is all jumbled and I am not aware of how to fix it. (Example of data below, and plot)
df <- zoo(HIP_131_Y0_LC_walk1[1:9])
plot(df)
I have multiple data frames saved in a list so ultimately would like to run a function and apply to all. The zoo function solves step one but I am not able to apply to all the data frames in the list.
graph<-lapply(myfiles,function(x) zoo(x) )
print(graph)
Ideally I would like to also mark minimum and maximum, which I am aware can be done with ggplot but not zoo.
Thank you so much for your help in advance
Assuming that the problem is overlapped panel names there are numerous solutions to this:
abbreviate the names using abbreviate. We show this for plot.zoo and autoplot.zoo .
put the panel name in the upper left. We show this for plot.zoo using a custom panel.
Use a header on each panel. We show this using xyplot.zoo and using ggplot.
The examples below use the test input in the Note at the end. (Next time please provide a complete example including all input in reproducible form.)
The first two examples below abbreviates the panel names and using plot.zoo and autoplot.zoo (which uses ggplot2). The third example uses xyplot.zoo (which uses lattice). This automatically uses headers and is probably the easiest solution.
library(zoo)
plot(z, ylab = abbreviate(names(z), 8))
library(ggplot2)
zz <- setNames(z, abbreviate(names(z), 8))
autoplot(zz)
library (lattice)
xyplot(z)
(click on plots to see expanded; continued after plots)
This fourth example puts the panel names in the upper left of the panel themselves using plot.zoo with a custom panel.
pnl <- function(x, y, ..., pf = parent.frame()) {
legend("topleft", names(z)[pf$panel.number], bty = "n", inset = -0.1)
lines(x, y)
}
plot(z, panel = pnl, ylab = "")
(click on plot to see it expanded)
We can also get headers with autoplot.zoo similar to in lattice above.
library(ggplot2)
autoplot(z, facets = ~ Series, col = I("black")) +
theme(legend.position = "none")
(click to expand; continued after graphics)
List
If you have a list of vectors L (see Note at end for a reproducible example of such a list) then this will produce a zoo object:
do.call("merge", lapply(L, zoo))
Note
Test input used above.
library(zoo)
set.seed(123)
nms <- paste0(head(state.name, 9), "XYZ") # long names
m <- matrix(rnorm(101*9), 101, dimnames = list(NULL, nms))
z <- zoo(m)
L <- split(m, col(m)) # test list using m in Note

Remove outliers by condition from list of data frames

I try to create a function to remove multiple outliers via cooks distance from a list of data frames.
There are some problems at the moment:
Can I formulate part 1 as function? I tried several things that did not work out. I want to use several different variables for the lm - so it would be great if I could use colnumbers and the regular expression syntax of data frames as input argument.
Part 2 - the filename of the plots are not correct. It takes the first observation in each data frame from the list as filename. How can I correct this?
Part 3: data frames without the outliers are not created. Function comes to an end after the message is printed. I can't find my mistake.
data(iris)
iris.lst <- split(iris[, 1:2], iris$Species)
new_names <- c(paste0(unlist(levels(iris$Species)),"_data"))
for (i in 1:length(iris.lst)) {
assign(new_names[i], iris.lst[[i]])
}
# Part 1: Then cooks distances
fit <- lapply(mget(ls(pattern = "_data")),
function(x) lm(x[,1] ~ x[,3], data = x))
cooksd <-lapply(fit,cooks.distance)
# Part 2: Plot each data frame with suspected outlier
plots <- function(x){
jpeg(file=paste0(names(x),".jpeg")) # file names are numbers
#par(mfrow=c(2,1))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance") # plot cook's distance
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1, y=x, labels=ifelse(x > 3*mean(x, na.rm=T),
names(x),""), col="red")
dev.off()
}
myplots <- lapply(cooksd, plots)
# Part 3: give me new data frames without influential cases
show_influential_cases <- function(x){
# invisible(cooksd[["n_OG"]] <- lapply(cooksd, length)
influential <- lapply(x,function(x) names(x)[x > 3*mean(x, na.rm=T)])
test <- as.data.frame(unlist(influential))[,1]
test <- as.numeric(test)
}
tested <- show_influential_cases(result)
cleaned_data <- add_new[-tested,] # removing outliers by indexing
Could someone please help me to improve my code?
Many thanks,
Nadine
In general, it is not a good practice to create multiple dataframes in global environment. Lists always are a better option, they are easy to manage.
Part 1 -
You can combine multiple steps in one lapply function. Here in part 1 we apply lm and cooks.distance function together in the same lapply call.
master_data <- split(iris[, 1:2], iris$Species)
data <- lapply(master_data, function(x) {
cooks.distance(lm(Sepal.Length ~ Sepal.Width, data = x))
})
new_names <- paste0(levels(iris$Species),"_data")
names(data) <- new_names
Part 2 -
lapply does not have access to names of the list, pass them separately and use Map to call plots function.
plots <- function(x, y){
jpeg(file=paste0(y,".jpeg"))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance")
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1,y=x,labels=ifelse(x > 3*mean(x, na.rm=T),y,""), col="red")
dev.off()
}
Map(plots, data, names(data))
Part 3 -
I am not exactly clear about how you want to perform Part3 but for now I am showing outlier and data separately.
remove_influential_cases <- function(x, y){
inds <- x > 3*mean(x, na.rm=TRUE)
y[!inds, ]
}
result <- Map(remove_influential_cases, data, master_data)

How to use plot function to plot results of your own function?

I'm writing a short R package which contains a function. The function returns a list of vectors. I would like to use the plot function in order to plot by default a plot done with some of those vectors, add lines and add a new parameter.
As an example, if I use the survival package I can get the following:
library(survival)
data <- survfit(Surv(time, status == 2) ~ 1, data = pbc)
plot(data) # Plots the result of survfit
plot(data, conf.int = "none") # New parameter
In order to try to make a reproducible example:
f <- function(x, y){
b <- x^2
c <- y^2
d <- x+y
return(list(one = b, two = c, three = d))
}
dat <- f(3, 2)
So using plot(dat) I would like to get the same as plot(dat$one, dat$two). I would also like to add one more (new) parameter that could be set to TRUE/FALSE.
Is this possible?
I think you might be looking for classes. You can use the S3 system for this.
For your survival example, data has the class survfit (see class(data)). Then using plot(data) will look for a function called plot.survfit. That is actually a non-exported function in the survival package, at survival:::plot.survfit.
You can easily do the same for your package. For example, have a function that creates an object of class my_class, and then define a plotting method for that class:
f <- function(x, y){
b <- x^2
c <- y^2
d <- x+y
r <- list(one = b, two = c, three = d)
class(r) <- c('list', 'my_class') # this is the important bit.
r
}
plot.my_class <- function(x) {
plot(x$one, x$two)
}
Now your code should work:
dat <- f(3, 2)
plot(dat)
You can put anything in plot.my_class you want, including additional arguments, as long as your first argument is x and is the my_class object.
plot now calls plot.my_class, since dat is of class my_class.
You can also add other methods, e.g. for print.
There are many different plotting functions that can be called with plot for different classes, see methods(plot)
Also see Hadley's Advanced R book chapter on S3.

Gantt plot in base r - modifying plot properties

I would like to ask a follow-up question related to the answer given in this post [Gantt style time line plot (in base R) ] on Gantt plots in base r. I feel like this is worth a new question as I think these plots have a broad appeal. I'm also hoping that a new question would attract more attention. I also feel like I need more space than the comments of that question to be specific.
The following code was given by #digEmAll . It takes a dataframe with columns referring to a start time, end time, and grouping variable and turns that into a Gantt plot. I have modified #digEmAll 's function very slightly to get the bars/segments in the Gantt plot to be contiguous to one another rather than having a gap. Here it is:
plotGantt <- function(data, res.col='resources',
start.col='start', end.col='end', res.colors=rainbow(30))
{
#slightly enlarge Y axis margin to make space for labels
op <- par('mar')
par(mar = op + c(0,1.2,0,0))
minval <- min(data[,start.col])
maxval <- max(data[,end.col])
res.colors <- rev(res.colors)
resources <- sort(unique(data[,res.col]),decreasing=T)
plot(c(minval,maxval),
c(0.5,length(resources)+0.5),
type='n', xlab='Duration',ylab=NA,yaxt='n' )
axis(side=2,at=1:length(resources),labels=resources,las=1)
for(i in 1:length(resources))
{
yTop <- i+0.5
yBottom <- i-0.5
subset <- data[data[,res.col] == resources[i],]
for(r in 1:nrow(subset))
{
color <- res.colors[((i-1)%%length(res.colors))+1]
start <- subset[r,start.col]
end <- subset[r,end.col]
rect(start,yBottom,end,yTop,col=color)
}
}
par(op) # reset the plotting margins
}
Here are some sample data. You will notice that I have four groups 1-4. However, not all dataframes have all four groups. Some only have two, some only have 3.
mydf1 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,1,1,2,2,2,1,1,1))
mydf2 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,2,2,3,4,3,2,1,1))
mydf3 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(4,4,4,4,4,4,3,2,3,3))
mydf4 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,1,2,3,3,3,2,1,1))
Here I run the above function, but specify four colors for plotting:
plotGantt(mydf1, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf2, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf3, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf4, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
These are the plots:
What I would like to do is modify the function so that:
1) it will plot on the y-axis all four groups regardless of whether they actually appear in the data or not.
2) Have the same color associated with each group for every plot regardless of how many groups there are. As you can see, mydf2 has four groups and all four colors are plotted (1-red, 2-orange, 3-yellow, 4-gray). These colors are actually plotted with the same groups for mydf3 as that only contains groups 2,3,4 and the colors are picked in reverse order. However mydf1 and mydf4 have different colors plotted for each group as they do not have any group 4's. Gray is still the first color chosen but now it is used for the lowest occurring group (group2 in mydf1 and group3 in mydf3).
It appears to me that the main thing I need to work on is the vector 'resources' inside the function, and have that not just contain the unique groups but all. When I try manually overriding to make sure it contains all the groups, e.g. doing something as simple as resources <-as.factor(1:4) then I get an error:
'Error in rect(start, yBottom, end, yTop, col = color) : cannot mix zero-length and non-zero- length coordinates'
Presumably the for loop does not know how to plot data that do not exist for groups that don't exist.
I hope that this is a replicable/readable question and it's clear what I'm trying to do.
EDIT: I realize that to solve the color problem, I could just specify the colors for the 3 groups that exist in each of these sample dfs. However, my intention is to use this plot as an output to a function whereby it wouldn't be known ahead of time if all of the groups exist for a particular df.
I slightly modified your function to account for NA in start and end dates :
plotGantt <- function(data, res.col='resources',
start.col='start', end.col='end', res.colors=rainbow(30))
{
#slightly enlarge Y axis margin to make space for labels
op <- par('mar')
par(mar = op + c(0,1.2,0,0))
minval <- min(data[,start.col],na.rm=T)
maxval <- max(data[,end.col],na.rm=T)
res.colors <- rev(res.colors)
resources <- sort(unique(data[,res.col]),decreasing=T)
plot(c(minval,maxval),
c(0.5,length(resources)+0.5),
type='n', xlab='Duration',ylab=NA,yaxt='n' )
axis(side=2,at=1:length(resources),labels=resources,las=1)
for(i in 1:length(resources))
{
yTop <- i+0.5
yBottom <- i-0.5
subset <- data[data[,res.col] == resources[i],]
for(r in 1:nrow(subset))
{
color <- res.colors[((i-1)%%length(res.colors))+1]
start <- subset[r,start.col]
end <- subset[r,end.col]
rect(start,yBottom,end,yTop,col=color)
}
}
par(mar=op) # reset the plotting margins
invisible()
}
In this way, if you simply append all your possible group values to your data you'll get them printed on the y axis. e.g. :
mydf1 <- data.frame(startyear=2000:2009, endyear=2001:2010,
group=c(1,1,1,1,2,2,2,1,1,1))
# add all the group values you want to print with NA dates
mydf1 <- rbind(mydf1,data.frame(startyear=NA,endyear=NA,group=1:4))
plotGantt(mydf1, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
About the colors, at the moment the ordered res.colors are applied to the sorted groups; so the 1st color in res.colors is applied to 1st (sorted) group and so on...

Labeling outliers on boxplot in R

I would like to plot each column of a matrix as a boxplot and then label the outliers in each boxplot as the row name they belong to in the matrix. To use an example:
vv=matrix(c(1,2,3,4,8,15,30),nrow=7,ncol=4,byrow=F)
rownames(vv)=c("one","two","three","four","five","six","seven")
boxplot(vv)
I would like to label the outlier in each plot (in this case 30) as the row name it belongs to, so in this case 30 belongs to row 7. Is there an easy way to do this? I have seen similar questions to this asked but none seemed to have worked the way I want it to.
There is a simple way. Note that b in Boxplot in following lines is a capital letter.
library(car)
Boxplot(y ~ x, id.method="y")
Or alternatively, you could use the "Boxplot" function from the {car} package which labels outliers for you.
See the following link: https://CRAN.R-project.org/package=car
In the example given it's a bit boring because they are all the same row. but here is the code:
bxpdat <- boxplot(vv)
text(bxpdat$group, # the x locations
bxpdat$out, # the y values
rownames(vv)[which(vv == bxpdat$out, arr.ind=TRUE)[, 1]], # the labels
pos = 4)
This picks the rownames that have values equal to the "out" list (i.e., the outliers) in the result of boxplot. Boxplot calls and returns the values from boxplot.stats. Take a look at:
str(bxpdat)
#DWin's solution works very well for a single boxplot, but will fail for anything with duplicate values, like the dataset I have created:
#Create data
set.seed(1)
basenums <- c(1,2,3,4,8,15,30)
vv=matrix(c(basenums, sample(basenums), 1-basenums,
c(0, 29, 30, 31, 32, 33, 60)),nrow=7,ncol=4,byrow=F)
dimnames(vv)=list(c("one","two","three","four","five","six","seven"), 1:4)
On this dataset, #DWin's solution gives:
Which is false, because in the 4th example, it is not possible for the minimum and maximum to be in the same row.
This solution is monstrous (and I hope can be simplified), but effective.
#Reshape data
vv_dat <- as.data.frame(vv)
vv_dat$row <- row.names(vv_dat)
library(reshape2)
new_vv <- melt(vv_dat, id.vars="row")
#Get boxplot data
bxpdat <- as.data.frame(boxplot(value~variable, data=new_vv)[c("out", "group")])
#Get matches with boxplot data
text_guide <- do.call(rbind, apply(bxpdat, 1,
function(x) new_vv[new_vv$value==x[1]&new_vv$variable==x[2], ]))
#Add labels
with(text_guide, text(x=as.numeric(variable)+0.2, y=value, labels=row))
Or you can simply run the code from this blog post:
source("https://raw.githubusercontent.com/talgalili/R-code-snippets/master/boxplot.with.outlier.label.r") # Load the function
set.seed(6484)
y <- rnorm(20)
x1 <- sample(letters[1:2], 20,T)
lab_y <- sample(letters, 20)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x1, lab_y)
(which handles multiple outliers which are close to one another)
#sebastian-c
This is a slight modification of DWin solution that seem to work with more generality
bx1<-boxplot(pb,las=2,cex.axis=.8)
if(length(bx1$out)!=0){
## get the row of each outlier
out.rows<-sapply(1:length(bx1$out),function(i) which(vv[,bx1$group[i]]==bx1$out[i]))
text(bx1$group,bx1$out,
rownames(vv)[out.rows],
pos=4
)
}

Resources