I am trying to create a 'before and after' line chart which shows results of blood tests before and after an operation. I have 307 pairs of data so need to get the lines function to plot a line for each of the 307 columns in the matrix of data created from pre- and post-operative data (one column: one patient). So I tried this:
ylabel<-"Platelet count (millions/ml)"
preoptpk<-c(100,101,102,103,104,105)
postoptpk<-c(106,107,108,109,110,111)
preoptpk<-t(matrix(preoptpk))
postoptpk<-t(matrix(postoptpk))
preoptpk
postoptpk
beforeandafterdata<-rbind(preoptpk, postoptpk)
beforeandafterdata
ylimits<-c(0.8*min(beforeandafterdata,na.rm=TRUE),1.15*max(beforeandafterdata, na.rm=TRUE))
ylimits
plot(beforeandafterdata[,1], type = "l", col = "black", xlim = c(0.9, 2.1),
ylim = ylimits, ann = FALSE, axes = FALSE)
title(ylab=ylabel, cex.lab=1.4)
axis(1,at=1:2,lab=c("Preop.","Postop."),cex.axis=1.5)
axis(2,labels=TRUE)
x<-c(1*2:6)
x
lines(beforeandafterdata[,x],type="l",col="black",
xlim=c(0.9,2.1),ylim=ylimits,ann=FALSE)
..and nothing happened.
I don't understand why I can't use x<-c(1*2:307) since when I manually define x as 2 then 3 then 4 then 5 then 6 it works fine:
x <- 2 x
lines(beforeandafterdata[,x],type="l",col="black",xlim=c(0.9,2.1),ylim=ylimits,ann=FALSE)
x <-3 x
lines(beforeandafterdata[,x],type="l",col="black",xlim=c(0.9,2.1),ylim=ylimits,ann=FALSE)
x <-4 x
lines(beforeandafterdata[,x],type="l",col="black",xlim=c(0.9,2.1),ylim=ylimits,ann=FALSE)
x<-5 x
lines(beforeandafterdata[,x],type="l",col="black",xlim=c(0.9,2.1),ylim=ylimits,ann=FALSE)
x<-6 x
lines(beforeandafterdata[,x],type="l",col="black",xlim=c(0.9,2.1),ylim=ylimits,ann=FALSE)
x<-c(1*2:6)
Any help how I can get this to work? Since I have several variables and manually plotting 307 lines for each will be v. time consuming. Thanks for reply.
Trying to stay close to your example, you need to use xy.coords within lines.
plot(beforeandafterdata[,1],type="l",col="black",xlim=c(0.9,2.1),ylim=ylimits,
ann=FALSE,axes=FALSE)
title(ylab=ylabel, cex.lab=1.4)
axis(1,at=1:2,lab=c("Preop.","Postop."),cex.axis=1.5)
axis(2,labels=TRUE)
x<-c(1*2:6)
x
lapply(x, function(x){
lines(xy.coords(x=c(1, 2), y=c(beforeandafterdata[,x])), type="l", col="black",
xlim=c(0.9,2.1),ylim=ylimits,ann=FALSE)
})
lapply is needed to prevent one line being joined to the next
You could use a for loop to do this. e.g.:
for (x in 2:6) {
lines(beforendafterdata[,x], ...)
}
Or you can use the reshape2 and ggplot2 packages. First melt your data into a long format that ggplot2 likes:
library(reshape2)
beforeandafter_melted <- melt(beforeandafterdata)
Then plot away. You don't need the color argument, but the group is important to force individual lines to be drawn.
library(ggplot2)
ggplot(beforeandafter_melted, aes(x=Var1, y=value, color=factor(Var2), group=Var2)) +
geom_line()
Where Var1 is the row (1 or 2) and Var2 is the column (1 to 6) from your initial matrix beforeandafterdata.
Also, why have you written x <- c(1*2:307)? This is no different than 2:307 (unless you're trying to force numeric conversion, but that isn't the way to go about it).
all.equal(c(1*2:307), 2:307)
# [1] TRUE
Related
I want to create a vector of functions with two parameters where one parameter is over a continuous range of values and the other runs over a fixed number of numerical values saved in the column vector dat[,2].
# Example functions
icc <- function(year, x) {
z = exp(year - x)
inf = z / (1 + z)
return (inf)
}
# Example data
year <- seq(-4, 4, 0.1)
x1 <- dat[1, 2]
x2 <- dat[2, 2]
# Plots
plot(t, icc(year, x1), type = "l")
plot(t, icc(year, x2), type = "l")
The issues are
dat[,2] has more than just 2 values and I want to be able to plot all the corresponding functions on the same plot but with different colors
manually assigning colors to each line is difficult as there are a large number of lines
dat[,1] stores the corresponding label to each plot; would it be possible to add them over each line?
I have something like this in mind-
UPDATE: dat is simply a 40 x 2 table storing strings in the first column and numerical values in the second. By 'a vector of functions', I mean an array containing functions with parameter values unique to each row. For example- if t^i is the function then, element 1 of the array is the function t^1, element 2 is t^2 and so on where t is a 'range'. (Label and color are extras and not too important. If unanswered, I'll post another question for them).
The function to use is matplot, not plot. There is also matlines but if the data to be plotted is in a matrix, matplot can plot all columns in one call.
Create a vector of y coordinates, yy, from the x values. This is done in a sapply loop. In the code below I have called the x coordinates values xx since there is no dat[,2] to work with.
Plot the resulting matrix in one matplot function call, which takes care of the colors automatically.
The lines labels problem is not addressed, only the lines plotting problem. With so many lines their labels would make the plot more difficult to read.
icc <- function(year, x) {
z = exp(year - x)
inf = z / (1 + z)
return (inf)
}
# Example data
year <- seq(-4, 4, 0.1)
xx <- seq(-1, 1, by = 0.2)
yy <- sapply(xx, \(x) icc(year, x))
matplot(year, yy, type = "l", lty = "solid")
Created on 2022-07-26 by the reprex package (v2.0.1)
Note
Function icc is the logistic distribution CDF with location x and scale 1. The base R plogis function can substitute for it, the results are equal within floating-point precision.
icc2 <- function(year, x) plogis(year, location = x, scale = 1)
yy2 <- sapply(xx, \(x) icc2(year, x))
identical(yy, yy2)
#> [1] FALSE
all.equal(yy, yy2)
#> [1] TRUE
I am learning R and I am trying to create histograms and boxplots for every column in a dataframe which has 80+ columns. The plots will be grouped based on the value of a column named "cluster".
Since this task is quite cumbersome and the names of the columns are not user-friendly, and given that more tasks of this kind will come in the future, I was thinking to find a way to automate the process.
So I came up with the idea to create a function that will call the histogram() and boxplot() functions of ggplot and will create two ggplot objects p1 and p2 which will store in a list. The function would then return the list. Then loop through the columns of the dataframe and apply the function and store the results in a list called plots_all. Finally, extract the ggplot objects (histograms and boxplots) one at the time and print them.
However, I have difficulty implementing this idea. Perhaps, there are other ways more efficient to perform the same task. In any case, I would appreciate your help.
More specifically, I can not get the means of the columns appear by group in the histogram using the function as they would appear if I wrote the command myself. Second, I can not pass the name of the column to the function and use it appropriately to label the graph. Third, I find a difficulty in extracting exactly the plot I want from the list (I get both plots simultaneously). Of course I could write two functions each dedicated to a single type of plot, but still I am curious why my method is not working as I would expect. So now, let's dive in!
Let me begin by giving some background information:
A glimpse on the data:
head(df_all[, c(1:2, 48)])
TOTAL_Estimated_Collateral_value_sum TOTAL_CREDIT_BUREAU_RATING_max cluster
1 -0.17499342 -0.37721374 1
2 -0.86443362 -0.50003823 1
3 0.22211949 -0.49997598 2
4 0.01007717 -0.07512348 1
5 -0.77617685 -0.49997598 2
6 -1.43518056 -0.42273492 1
> table(df_all$cluster)
1 2 3
24342 8565 1350
The code I am using is the following:
plots <- function(w, n, df_all){
# This function takes three arguments: w is a column of the dataframe, n is the name of that column and df_all is the dataframe from which the column originates
mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(w, na.rm = TRUE)) #calculate the mean of the column
# Creates a histogram using the column w. The name of the column n is used in the labs() to define axes labels and
# graph title. Stores the histogram object to a variable p1.
p1 <- ggplot(df_all, aes(x = w, fill= cluster)) +
geom_histogram(alpha = 0.7, position="dodge")+
geom_vline(data=mu, aes(xintercept=grp.mean, color= cluster),
linetype="dashed", size = 2)+
labs(title = n, x = "Cluster", y = n)
# Creates a boxplot using the column w. The name of the column n is used in the labs() to define axes labels and
# graph title. Stores the histogram object to a variable p2.
p2 <- ggplot(df_all, aes(x=cluster, y=w, fill=cluster)) +
geom_boxplot() + labs(title = n, x = "Cluster", y = n)
plot <- list() # Initiates an empty list
plot[[1]] <- p1 #Appends the object p1 to the list
plot[[2]] <- p2 #Appends the object p2 to the list
plot # Returns the list
}
plots_all <- list() # initiates
for (i in 1 : 38){ # Loops over a selection of the indices of the columns of df_all
n <- names(df_all[,i]) # Extracts the name of the column at index i and stores it to variable n
w <- df_all[,i] # Extracts the df_all column at index i and stores it to a vector w
plots_all[[i]] <- plots(w, n, df_all) #Call the plots() function with the appropriate arguments and stores the
# returned list to a list plots_all
}
To get the plots for the first column I write:
plots_all[[1]]
This will plot both plots --histogram and boxplot-- at one stroke. So I am not given an opportunity to select which of the two to display.
Moreover, I get a histogram that looks like this:
As you can see this histogram does not display the means of the three groups as vertical lines but only one mean.
However using the following code, I can get the 3 group means appearing:
mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(TOTAL_Estimated_Collateral_value_sum, na.rm = TRUE))
ggplot(df_all, aes(x=TOTAL_Estimated_Collateral_value_sum, fill= cluster)) +
geom_histogram(alpha = 0.7, position="dodge")+
geom_vline(data=mu, aes(xintercept=grp.mean, color= cluster),
linetype="dashed", size = 2)+
theme(legend.position="top")
You can inspect the output here:
So, 3 questions:
1) Why the group means do not appear as vertical lines as I would expect when I am using the function? What should I change?
2) How I can pass to the ggplot labs() function the information I want (a string that is a function of the name of the column I am passing to the function) when I am using the function, to label axes and title the graph appropriately? Should I use paste() in some way and if yes how?
3) How I can control which plot I will print (histogram vs. boxplot)
Your advice will be appreciated.
I did the following change in the code of the function:
plot <- list(list(p1), list(p2), list(mu))
Then, I can get the elements separately using the following slicing syntax:
plots_all <- plots(w, n, df)
plots_all[[1]][[2]] # Returns the boxplot (second plot) of the first variable
I still have not found how to get the means right and how to use the name of the variable passed to the function in defining plot labels and title.
In relation to the mean problem (mu), mu is returned with the same value for all levels of cluster, which is basically zero, which implies that it aggregates over the entire range ignoring 'cluster' (the variable is standardized). I.e.:
cluster grp.mean
1 1 -3.542677e-17
2 2 -3.542677e-17
3 3 -3.542677e-17
However, when I run the command outside of the function it returns the right answer:
> mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(TOTAL_Estimated_Collateral_value_sum, na.rm = TRUE))
> mu
cluster grp.mean
1 1 -0.042860846
2 2 0.120947850
3 3 0.005481753
Is it possible to plot pairs of columns in a single plot with a loop? For example, if I have a data frame of time series with 10 columns (x1, x2.. x10), I would like to create 5 plots: 1st plot will display x1 and x2, the 2nd plot would display x3 and x4 and so on.
Any plotting method would be useful, (zoo, lattice, ggplot2).
I got stuck at creating a loop to plot a single variable:
set.seed(1)
x<- data.frame(replicate(10,rnorm(10, mean = 0, sd = 1)))
cols <- seq(1,10)
library(zoo)
z <- read.zoo(x)
for (i in cols) {
plot(z[,i], screen = 1)
}
Thanks in advance.
How about this with ggplot2 and reshape2:
require(reshape2)
require(ggplot2)
m<-melt(matrix(z,10))
m$facet<-cut(m$Var2,c(0,2,4,6,8,10))
ggplot(m)+geom_line(aes(x=Var1,y=value,group=Var2,color=factor(Var2)))+facet_wrap(~ facet)
It can be done in a single line without a loop like this where the col argument specifies that the odd series are black and the even are red. Note that z in the question has 9 columns (since the first column in x is the time index) so we have used a 10 column z below instead which was likely what was intended.
library(zoo)
# test data
set.seed(123); z <- zoo(matrix(rnorm(250), 25)); colnames(z) <- make.names(1:10)
plot(z, screen = rep(colnames(z)[c(TRUE, FALSE)], each = 2), col = 1:2)
The output is shown below. To produce a single column add the argument nc=1 or to produce a lattice plot replace plot with xyplot.
ADDED: lattice solution.
like this? Although I am not clear how you want to plot it.
par(mfrow=c(1,5))
for (i in seq(1,10,by=2)){
plot(x[,i],x[,i+1])
}
I have data in R with overlapping points.
x = c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y = c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
plot(x,y)
How can I plot these points so that the points that are overlapped are proportionally larger than the points that are not. For example, if 3 points lie at (4,5), then the dot at position (4,5) should be three times as large as a dot with only one point.
Here's one way using ggplot2:
x = c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y = c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
df <- data.frame(x = x,y = y)
ggplot(data = df,aes(x = x,y = y)) + stat_sum()
By default, stat_sum uses the proportion of instances. You can use raw counts instead by doing something like:
ggplot(data = df,aes(x = x,y = y)) + stat_sum(aes(size = ..n..))
Here's a simpler (I think) solution:
x <- c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y <- c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
size <- sapply(1:length(x), function(i) { sum(x==x[i] & y==y[i]) })
plot(x,y, cex=size)
## Tabulate the number of occurrences of each cooordinate
df <- data.frame(x, y)
df2 <- cbind(unique(df), value = with(df, tapply(x, paste(x,y), length)))
## Use cex to set point size to some function of coordinate count
## (By using sqrt(value), the _area_ of each point will be proportional
## to the number of observations it represents)
plot(y ~ x, cex = sqrt(value), data = df2, pch = 16)
You didn't really ask for this approach but alpha may be another way to address this:
library(ggplot2)
ggplot(data.frame(x=x, y=y), aes(x, y)) + geom_point(alpha=.3, size = 3)
You need to add the parameter cex to your plot function. First what I would do is use the function as.data.frame and table to reduce your data to unique (x,y) pairs and their frequencies:
new.data = as.data.frame(table(x,y))
new.data = new.data[new.data$Freq != 0,] # Remove points with zero frequency
The only downside to this is that it converts numeric data to factors. So convert back to numeric, and plot!
plot(as.numeric(new.data$x), as.numeric(new.data$y), cex = as.numeric(new.data$Freq))
You may also want to try sunflowerplot.
sunflowerplot(x,y)
Let me propose alternatives to adjusting the size of the points. One of the drawbacks of using size (radius? area?) is that the reader's evaluation of spot size vs. the underlying numeric value is subjective.
So, option 1: plot each point with transparency --- ninja'd by Tyler!
option 2: use jitter to push your data around slightly so the plotted points don't overlap.
A solution using lattice and table ( similar to #R_User but no need to remove 0 since lattice do the job)
dt <- as.data.frame(table(x,y))
xyplot(dt$y~dt$x, cex = dt$Freq^2, col =dt$Freq)
I have a data frame with a quantitative variable, x, and several different factors, f1, f2, ...,fn. The number of levels is not constant across factors.
I want to create a (single) plot of densities of x by factor level fi.
I know how to hand code this for a specific factor. For example, here is the plot for a factor with two levels.
# set up the background plot
plot(density(frame$x[frame$f1=="level1"]))
# add curves
lines(density(frame$x[frame$f1=="level2"]))
I could also do this like so:
# set up the background plot
plot(NA)
# add curves
lines(density(frame$x[frame$f1=="level1"]))
lines(density(frame$x[frame$f1=="level2"]))
What I'd like to know is how can I do this if I only specify the factor as input. I don't even know how to write a for loop that would do what I need, and I have the feeling that the 'R way' would avoid for loops.
Bonus: For the plots, I would like to specify limiting values for the axes. Right now I do this in this way:
xmin=min(frame$x[frame$f1=="level1"],frame$x[frame$f1=="level2"])
How can I include this type of calculation in my script?
I'm assuming your data is in the format (data frame called df)
f1 f2 f3 fn value
A........................... value 1
A............................value 2
.............................
B............................value n-1
B............................value n
In that cause, lattice (or ggplot2) will be very useful.
library(lattice)
densityplot(~value, groups = f1, data = df, plot.points = FALSE)
This should get you close to what you are looking for, I think.
Greg
You could also do:
# create an empty plot. You may want to add xlab, ylab etc
# EDIT: also add some appropriate axis limits with xlim and ylim
plot(0, 0, "n", xlim=c(0, 10), ylim=c(0, 2))
levels <- unique(frame$f1)
for (l in levels)
{
lines(density(frame$x[frame$f1==l]))
}
ggplot2 code
library(ggplot2)
ggplot(data, aes(value, colour = f1)) +
stat_density(position = "identity")