I have a for loop from which I call a function grapher() which extracts certain columns from a dataframe (position and w, both continuous variables) and plots them. My code changes the Y variable (called w here) each time it runs and so I'd like to plot it as an overlay progressively. If I run the grapher() function 4 times for example, I'd like to have 4 plots where the first plot has only 1 line, and the 4th has all 4 overlain on each other (as different colours).
I've already tried points() as suggested in other posts, but for some reason it only generates a new graph.
grapher <- function(){
position.2L <- data[data$V1=='2L', 'V2']
w.2L <- data[data$V1=='2L', 'w']
plot(position.2L, w.2L)
points(position.2L, w.2L, col='green')
}
# example of my for loop #
for (t in 1:200){
#code here changes the 'w' variable each iteration of 't'
if (t%%50==0){
grapher()
}
}
Not knowing any details about your situation I can only assume something like this might be applicable.
# Example data set
d <- data.frame(V1=rep(1:2, each=6), V2=rep(1:6, 2), w=rep(1:6, each=2))
# Prepare the matrix we will write to.
n <- 200
m <- matrix(d$w, nrow(d), n)
# Loop progressively adding more noise to the data
set.seed(1)
for (i in 2:n) {
m[,i] <- m[,i-1] + rnorm(nrow(d), 0, 0.05)
}
# We can now plot the matrix, selecting the relevant rows and columns
matplot(m[d$V1 == 1, seq(1, n, by=50)], type="o", pch=16, lty=1)
Related
I want to produce a loss exceedance curve with the code:
x <- rpois (10000, 3) * rlnorm (10000, 100,10)
number_sequence = seq(0.1,1,0.1)
for (i in number_sequence)
{
y <- quantile (log(x), c(i))
plot (y,i)
}
I am getting 10 separated plot charts. The information is right, but I need to have the 10 data points in the same chart to make a curve as (10%, 85), (20%, 91).. (100%, 141) . I failed to make a matrix to plot at the end.
I am happy to technically explain how the formulas work for risk quantification in R.
I have gotten 10 separated plot charts rather than one single chart with the 10 data sets.
y <- c()
for (i in number_sequence)
{
y <- c(y,quantile (log(x), c(i)))
}
plot(y,number_sequence)
Or faster:
y <- quantile(log(x),number_sequence)
plot(y,number_sequence)
I try to create a function to remove multiple outliers via cooks distance from a list of data frames.
There are some problems at the moment:
Can I formulate part 1 as function? I tried several things that did not work out. I want to use several different variables for the lm - so it would be great if I could use colnumbers and the regular expression syntax of data frames as input argument.
Part 2 - the filename of the plots are not correct. It takes the first observation in each data frame from the list as filename. How can I correct this?
Part 3: data frames without the outliers are not created. Function comes to an end after the message is printed. I can't find my mistake.
data(iris)
iris.lst <- split(iris[, 1:2], iris$Species)
new_names <- c(paste0(unlist(levels(iris$Species)),"_data"))
for (i in 1:length(iris.lst)) {
assign(new_names[i], iris.lst[[i]])
}
# Part 1: Then cooks distances
fit <- lapply(mget(ls(pattern = "_data")),
function(x) lm(x[,1] ~ x[,3], data = x))
cooksd <-lapply(fit,cooks.distance)
# Part 2: Plot each data frame with suspected outlier
plots <- function(x){
jpeg(file=paste0(names(x),".jpeg")) # file names are numbers
#par(mfrow=c(2,1))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance") # plot cook's distance
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1, y=x, labels=ifelse(x > 3*mean(x, na.rm=T),
names(x),""), col="red")
dev.off()
}
myplots <- lapply(cooksd, plots)
# Part 3: give me new data frames without influential cases
show_influential_cases <- function(x){
# invisible(cooksd[["n_OG"]] <- lapply(cooksd, length)
influential <- lapply(x,function(x) names(x)[x > 3*mean(x, na.rm=T)])
test <- as.data.frame(unlist(influential))[,1]
test <- as.numeric(test)
}
tested <- show_influential_cases(result)
cleaned_data <- add_new[-tested,] # removing outliers by indexing
Could someone please help me to improve my code?
Many thanks,
Nadine
In general, it is not a good practice to create multiple dataframes in global environment. Lists always are a better option, they are easy to manage.
Part 1 -
You can combine multiple steps in one lapply function. Here in part 1 we apply lm and cooks.distance function together in the same lapply call.
master_data <- split(iris[, 1:2], iris$Species)
data <- lapply(master_data, function(x) {
cooks.distance(lm(Sepal.Length ~ Sepal.Width, data = x))
})
new_names <- paste0(levels(iris$Species),"_data")
names(data) <- new_names
Part 2 -
lapply does not have access to names of the list, pass them separately and use Map to call plots function.
plots <- function(x, y){
jpeg(file=paste0(y,".jpeg"))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance")
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1,y=x,labels=ifelse(x > 3*mean(x, na.rm=T),y,""), col="red")
dev.off()
}
Map(plots, data, names(data))
Part 3 -
I am not exactly clear about how you want to perform Part3 but for now I am showing outlier and data separately.
remove_influential_cases <- function(x, y){
inds <- x > 3*mean(x, na.rm=TRUE)
y[!inds, ]
}
result <- Map(remove_influential_cases, data, master_data)
I want to plot as below. I tried to search several packages and plot functions but I couldn't find a solution.
My data has four columns.
ID F R M
1 2 3 4
2 4 6 7
...
I want to see the relationship between M and R with respect to each F value (1, 2, 3, ...). So, I'd like F along the x-axis, R along the y-axis, and M as the z-axis as in the below graph.
Thanks.
You can do this kind of thing with lattice cloud plots, using panel.3dpolygon from latticeExtra.
library(latticeExtra)
# generating random data
d <- data.frame(x=rep(1:40, 7), y=rep(1:7, each=40),
z=c(sapply(1:7, function(x) runif(40, 10*x, 10*x+20))))
# define the panel function
f <- function(x, y, z, groups, subscripts, ...) {
colorz <- c('#8dd3c7', '#ffffb3', '#bebada', '#fb8072', '#80b1d3',
'#fdb462', '#b3de69')
sapply(sort(unique(groups), decreasing=TRUE), function(i) {
zz <- z[subscripts][groups==i]
yy <- y[subscripts][groups==i]
xx <- x[subscripts][groups==i]
panel.3dpolygon(c(xx, rev(xx)), c(yy, yy),
c(zz, rep(-0.5, length(zz))),
col=colorz[i], ...)
})
}
# plot
cloud(z~x+y, d, groups=y, panel.3d.cloud=f, scales=list(arrows=FALSE))
I'm sure I don't need to loop over groups in the panel function, but I always forget the correct incantation for subscripts and groups to work as intended.
As others have mentioned in comments, this type of plot might look snazzy, but can obscure data.
Here is a minimal example of the type of data I'm
strugling to plot:
These curves are drawn from two processes.
library("lattice")
x0<-matrix(NA,350,76)
for(i in 1:150) x0[i,]<-arima.sim(list(order=c(1,0,0),ar=0.9),n=76)
for(i in 151:350) x0[i,]<-arima.sim(list(order=c(1,0,0),ar=-0.9),n=76)
I'd like to plot them as line plots in a lattice made of two boxes. The box located above
would contain the first 150 curves (in orange) and the box below should display
the next 200 curves (which should be in blue). I don't need a
label or legend. I've tried to use the example shown on the man-page:
aa<-t(x0)
colnames(aa)<-c(paste0("v0_",1:150),paste0("v1_",1:200))
aa<-as.ts(aa)
xyplot(aa,screens=list(v0_="0","1"),col=list(v0_="orange",v1_="blue"),auto.key=FALSE)
but somehow it doesn't work.
This will do without additional factors (yet agstudy's solution is not so much of a hack like this one):
# This is equivalent to your for-loops, use whatever you prefer
x0 <- do.call(rbind, lapply(1:350, function(i) {
arima.sim(list(order=c(1,0,0), ar=ifelse(i <= 150, 0.9, -0.9)), n=76)
}))
plotStuff <- function(indices, ...) {
plot.new()
plot.window(xlim=c(1, ncol(x0)), ylim=range(x0[indices,]))
box()
for (i in indices)
lines(x0[i,], ...)
}
par(mfrow=c(2,1), mar=rep(1,4)) # two rows, reduced margin
plotStuff(1:150, col="orange")
plotStuff(151:350, col="blue")
You should put your data in the long format like this:
Var1 Var2 value group
1 1 v0_1 2.0696016 v0
2 2 v0_1 1.3954414 v0
..... ..........
26599 75 v1_200 0.3488131 v1
26600 76 v1_200 0.2957114 v1
For example using reshape2 :
library(reshape2)
aa.m <- melt(aa)
aa.m$group <- gsub('^(v[0-9])_(.*)','\\1',aa.m$Var2)
xyplot(value~Var1|group,data=aa.m,type='l',groups=group)
Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}